2025-05-07T20:23:26.0708077Z Current runner version: '2.323.0' 2025-05-07T20:23:26.0715107Z Runner name: 'i-02a13dec7b575dc8f' 2025-05-07T20:23:26.0716051Z Machine name: 'ip-10-0-35-243' 2025-05-07T20:23:26.0718743Z ##[group]GITHUB_TOKEN Permissions 2025-05-07T20:23:26.0720958Z Contents: read 2025-05-07T20:23:26.0721475Z Metadata: read 2025-05-07T20:23:26.0721968Z Packages: read 2025-05-07T20:23:26.0722460Z ##[endgroup] 2025-05-07T20:23:26.0724347Z Secret source: None 2025-05-07T20:23:26.0724962Z Prepare workflow directory 2025-05-07T20:23:26.1680499Z Prepare all required actions 2025-05-07T20:23:26.1719103Z Getting action download info 2025-05-07T20:23:26.3967547Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-05-07T20:23:26.6883205Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-05-07T20:23:27.0606919Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187) 2025-05-07T20:23:28.7679890Z Getting action download info 2025-05-07T20:23:28.9051295Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-05-07T20:23:29.0985662Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.12, 12.6.3, 12.6.3, clang) 2025-05-07T20:23:29.1492644Z A job started hook has been configured by the self-hosted runner administrator 2025-05-07T20:23:29.1601828Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-05-07T20:23:29.1613210Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:29.1613946Z ##[endgroup] 2025-05-07T20:23:30.4260877Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-05-07T20:23:30.4261307Z Instance Type: g5.4xlarge 2025-05-07T20:23:30.4261562Z AMI Name: unknown 2025-05-07T20:23:30.4299983Z AMI ID: ami-071226ecf16aa7d96 2025-05-07T20:23:35.8681726Z ##[group]Run actions/checkout@v4 2025-05-07T20:23:35.8682040Z with: 2025-05-07T20:23:35.8682286Z submodules: true 2025-05-07T20:23:35.8682538Z repository: pytorch/FBGEMM 2025-05-07T20:23:35.8682930Z token: *** 2025-05-07T20:23:35.8683149Z ssh-strict: true 2025-05-07T20:23:35.8683368Z ssh-user: git 2025-05-07T20:23:35.8683600Z persist-credentials: true 2025-05-07T20:23:35.8683857Z clean: true 2025-05-07T20:23:35.8684098Z sparse-checkout-cone-mode: true 2025-05-07T20:23:35.8684378Z fetch-depth: 1 2025-05-07T20:23:35.8684603Z fetch-tags: false 2025-05-07T20:23:35.8684830Z show-progress: true 2025-05-07T20:23:35.8685054Z lfs: false 2025-05-07T20:23:35.8685272Z set-safe-directory: true 2025-05-07T20:23:35.8685528Z env: 2025-05-07T20:23:35.8685756Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:35.8686088Z BUILD_ENV: build_binary 2025-05-07T20:23:35.8686375Z BUILD_TARGET: genai 2025-05-07T20:23:35.8686600Z BUILD_VARIANT: cuda 2025-05-07T20:23:35.8686873Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:35.8687130Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:35.8687366Z ##[endgroup] 2025-05-07T20:23:35.9869172Z Syncing repository: pytorch/FBGEMM 2025-05-07T20:23:35.9870388Z ##[group]Getting Git version info 2025-05-07T20:23:35.9870838Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM' 2025-05-07T20:23:35.9871763Z [command]/usr/bin/git version 2025-05-07T20:23:35.9872137Z git version 2.47.1 2025-05-07T20:23:35.9885520Z ##[endgroup] 2025-05-07T20:23:35.9907497Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/3b9d1976-eeb5-46b3-aa74-005056000165' before making global git config changes 2025-05-07T20:23:35.9908696Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:23:35.9912324Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:35.9954137Z [command]/usr/bin/git config --local --get remote.origin.url 2025-05-07T20:23:35.9978662Z https://github.com/pytorch/FBGEMM 2025-05-07T20:23:35.9996368Z ##[group]Removing previously created refs, to avoid conflicts 2025-05-07T20:23:36.0001243Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD 2025-05-07T20:23:36.0028040Z refs/heads/main 2025-05-07T20:23:36.0036942Z [command]/usr/bin/git checkout --detach 2025-05-07T20:23:36.8887308Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:36.8942136Z [command]/usr/bin/git branch --delete --force main 2025-05-07T20:23:36.8970312Z Deleted branch main (was b6b2ce3). 2025-05-07T20:23:36.8975878Z ##[endgroup] 2025-05-07T20:23:36.8979656Z [command]/usr/bin/git submodule status 2025-05-07T20:23:36.9403667Z e5d7c0bd5d9aec44d68830187138149e6a8c4e32 external/asmjit (e5d7c0b) 2025-05-07T20:23:36.9490582Z 4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 external/composable_kernel (4a61bdd) 2025-05-07T20:23:36.9580595Z 6543fec09b2f04ac4a666882998b534afc9c1349 external/cpuinfo (6543fec) 2025-05-07T20:23:36.9668877Z 3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 external/cutlass (3ed8d2e) 2025-05-07T20:23:36.9756335Z f8d7d77c06936315286eb55f8de22cd23c188571 external/googletest (f8d7d77) 2025-05-07T20:23:36.9839795Z 420084499c7c1e1c2d801922f40df202eac5f3a0 external/hipify_torch (4200844) 2025-05-07T20:23:36.9921053Z 9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 external/json (9cca280) 2025-05-07T20:23:36.9934641Z ##[group]Cleaning the repository 2025-05-07T20:23:36.9939698Z [command]/usr/bin/git clean -ffdx 2025-05-07T20:23:36.9996090Z [command]/usr/bin/git reset --hard HEAD 2025-05-07T20:23:37.0103725Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:37.0110666Z ##[endgroup] 2025-05-07T20:23:37.0112311Z ##[group]Disabling automatic garbage collection 2025-05-07T20:23:37.0115653Z [command]/usr/bin/git config --local gc.auto 0 2025-05-07T20:23:37.0147899Z ##[endgroup] 2025-05-07T20:23:37.0148275Z ##[group]Setting up auth 2025-05-07T20:23:37.0153544Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:23:37.0184690Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:23:37.0515494Z Entering 'external/asmjit' 2025-05-07T20:23:37.0589088Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.0664095Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.0731318Z Entering 'external/cutlass' 2025-05-07T20:23:37.0804065Z Entering 'external/googletest' 2025-05-07T20:23:37.0870791Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.0936370Z Entering 'external/json' 2025-05-07T20:23:37.1026636Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:23:37.1060486Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:23:37.1396577Z Entering 'external/asmjit' 2025-05-07T20:23:37.1463962Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.1537440Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.1604637Z Entering 'external/cutlass' 2025-05-07T20:23:37.1682121Z Entering 'external/googletest' 2025-05-07T20:23:37.1749219Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.1816714Z Entering 'external/json' 2025-05-07T20:23:37.1905471Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:23:37.1958614Z ##[endgroup] 2025-05-07T20:23:37.1959157Z ##[group]Fetching the repository 2025-05-07T20:23:37.1966097Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge 2025-05-07T20:23:37.4260657Z From https://github.com/pytorch/FBGEMM 2025-05-07T20:23:37.4261321Z * [new ref] a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge 2025-05-07T20:23:37.4287626Z ##[endgroup] 2025-05-07T20:23:37.4288103Z ##[group]Determining the checkout info 2025-05-07T20:23:37.4290050Z ##[endgroup] 2025-05-07T20:23:37.4295631Z [command]/usr/bin/git sparse-checkout disable 2025-05-07T20:23:37.4348031Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-05-07T20:23:37.4377239Z ##[group]Checking out the ref 2025-05-07T20:23:37.4381890Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge 2025-05-07T20:23:37.4509382Z Previous HEAD position was b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079) 2025-05-07T20:23:37.4512516Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4 2025-05-07T20:23:37.4521696Z ##[endgroup] 2025-05-07T20:23:37.4522094Z ##[group]Setting up auth for fetching submodules 2025-05-07T20:23:37.4528650Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T20:23:37.4579640Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-05-07T20:23:37.4609455Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-05-07T20:23:37.4640943Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-05-07T20:23:37.4669402Z ##[endgroup] 2025-05-07T20:23:37.4669945Z ##[group]Fetching submodules 2025-05-07T20:23:37.4673282Z [command]/usr/bin/git submodule sync 2025-05-07T20:23:37.5046934Z Synchronizing submodule url for 'external/asmjit' 2025-05-07T20:23:37.5047590Z Synchronizing submodule url for 'external/composable_kernel' 2025-05-07T20:23:37.5048137Z Synchronizing submodule url for 'external/cpuinfo' 2025-05-07T20:23:37.5048523Z Synchronizing submodule url for 'external/cutlass' 2025-05-07T20:23:37.5049207Z Synchronizing submodule url for 'external/googletest' 2025-05-07T20:23:37.5049633Z Synchronizing submodule url for 'external/hipify_torch' 2025-05-07T20:23:37.5050045Z Synchronizing submodule url for 'external/json' 2025-05-07T20:23:37.5062508Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 2025-05-07T20:23:37.5499652Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32' 2025-05-07T20:23:37.5651531Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406' 2025-05-07T20:23:37.5753490Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-05-07T20:23:37.5924764Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3' 2025-05-07T20:23:37.6015003Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571' 2025-05-07T20:23:37.6100993Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0' 2025-05-07T20:23:37.6206791Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-05-07T20:23:37.6224188Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0 2025-05-07T20:23:37.6559732Z Entering 'external/asmjit' 2025-05-07T20:23:37.6591869Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.6624198Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.6656906Z Entering 'external/cutlass' 2025-05-07T20:23:37.6688609Z Entering 'external/googletest' 2025-05-07T20:23:37.6720422Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.6752556Z Entering 'external/json' 2025-05-07T20:23:37.6796526Z ##[endgroup] 2025-05-07T20:23:37.6796956Z ##[group]Persisting credentials for submodules 2025-05-07T20:23:37.6802275Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-05-07T20:23:37.7132256Z Entering 'external/asmjit' 2025-05-07T20:23:37.7178633Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7179307Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7222597Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.7266469Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7266805Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7316889Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.7361397Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7361765Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7403964Z Entering 'external/cutlass' 2025-05-07T20:23:37.7447197Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7447659Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7498771Z Entering 'external/googletest' 2025-05-07T20:23:37.7541981Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7542329Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7584389Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.7627051Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7627501Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7671140Z Entering 'external/json' 2025-05-07T20:23:37.7713567Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7714032Z url.https://github.com/.insteadof 2025-05-07T20:23:37.7774499Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-05-07T20:23:37.8109913Z Entering 'external/asmjit' 2025-05-07T20:23:37.8172936Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config remote.origin.url 2025-05-07T20:23:37.8175945Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.8238572Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config remote.origin.url 2025-05-07T20:23:37.8241143Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.8302654Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config remote.origin.url 2025-05-07T20:23:37.8304996Z Entering 'external/cutlass' 2025-05-07T20:23:37.8368667Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config remote.origin.url 2025-05-07T20:23:37.8370596Z Entering 'external/googletest' 2025-05-07T20:23:37.8431383Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config remote.origin.url 2025-05-07T20:23:37.8434132Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.8494373Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config remote.origin.url 2025-05-07T20:23:37.8496956Z Entering 'external/json' 2025-05-07T20:23:37.8561837Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config remote.origin.url 2025-05-07T20:23:37.8688040Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-05-07T20:23:37.9022811Z Entering 'external/asmjit' 2025-05-07T20:23:37.9054966Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.9089198Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.9122100Z Entering 'external/cutlass' 2025-05-07T20:23:37.9155028Z Entering 'external/googletest' 2025-05-07T20:23:37.9187239Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.9219687Z Entering 'external/json' 2025-05-07T20:23:37.9275972Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-05-07T20:23:37.9612497Z Entering 'external/asmjit' 2025-05-07T20:23:37.9644599Z Entering 'external/composable_kernel' 2025-05-07T20:23:37.9675794Z Entering 'external/cpuinfo' 2025-05-07T20:23:37.9707513Z Entering 'external/cutlass' 2025-05-07T20:23:37.9739341Z Entering 'external/googletest' 2025-05-07T20:23:37.9771631Z Entering 'external/hipify_torch' 2025-05-07T20:23:37.9803000Z Entering 'external/json' 2025-05-07T20:23:37.9847149Z ##[endgroup] 2025-05-07T20:23:37.9890288Z [command]/usr/bin/git log -1 --format=%H 2025-05-07T20:23:37.9916910Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:38.0114361Z ##[group]Run actions/download-artifact@v4 2025-05-07T20:23:38.0114686Z with: 2025-05-07T20:23:38.0114933Z name: fbgemm_genai_x86_clang_py3.12_cu12.6.3.whl 2025-05-07T20:23:38.0115262Z merge-multiple: false 2025-05-07T20:23:38.0115520Z repository: pytorch/FBGEMM 2025-05-07T20:23:38.0115790Z run-id: 14891846252 2025-05-07T20:23:38.0116035Z env: 2025-05-07T20:23:38.0116257Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:38.0116551Z BUILD_ENV: build_binary 2025-05-07T20:23:38.0116797Z BUILD_TARGET: genai 2025-05-07T20:23:38.0117021Z BUILD_VARIANT: cuda 2025-05-07T20:23:38.0117260Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:38.0117515Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:38.0117759Z ##[endgroup] 2025-05-07T20:23:38.2480269Z Downloading single artifact 2025-05-07T20:23:38.3481990Z Preparing to download the following artifacts: 2025-05-07T20:23:38.3483042Z - fbgemm_genai_x86_clang_py3.12_cu12.6.3.whl (ID: 3081363158, Size: 12541158, Expected Digest: sha256:373c809c973bf06d642bb3f64051fc1f783379222e7abf42eee25d1e313140af) 2025-05-07T20:23:38.4401503Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-3b1ce936-8478-5297-b5a2-3b87565d3f2f/artifacts/fad341bebf692e31111b4381039b81f54868bd1760453cbce0dfdec7454245cc.zip 2025-05-07T20:23:38.4402971Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:38.5160975Z (node:65567) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead. 2025-05-07T20:23:38.5161964Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-05-07T20:23:38.6979904Z SHA256 digest of downloaded artifact is 373c809c973bf06d642bb3f64051fc1f783379222e7abf42eee25d1e313140af 2025-05-07T20:23:38.6980507Z Artifact download completed successfully. 2025-05-07T20:23:38.6980874Z Total of 1 artifact(s) downloaded 2025-05-07T20:23:38.6986434Z Download artifact has finished successfully 2025-05-07T20:23:38.7241752Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-05-07T20:23:38.7242149Z with: 2025-05-07T20:23:38.7242368Z driver-version: 570.133.07 2025-05-07T20:23:38.7242617Z env: 2025-05-07T20:23:38.7242849Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:38.7243150Z BUILD_ENV: build_binary 2025-05-07T20:23:38.7243392Z BUILD_TARGET: genai 2025-05-07T20:23:38.7243624Z BUILD_VARIANT: cuda 2025-05-07T20:23:38.7243856Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:38.7244113Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:38.7244351Z ##[endgroup] 2025-05-07T20:23:38.7341004Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-05-07T20:23:38.7341412Z with: 2025-05-07T20:23:38.7341631Z timeout_minutes: 10 2025-05-07T20:23:38.7341876Z max_attempts: 3 2025-05-07T20:23:38.7366537Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi 2025-05-07T20:23:38.7391624Z retry_wait_seconds: 10 2025-05-07T20:23:38.7391889Z polling_interval_seconds: 1 2025-05-07T20:23:38.7392153Z warning_on_retry: true 2025-05-07T20:23:38.7392406Z continue_on_error: false 2025-05-07T20:23:38.7392649Z env: 2025-05-07T20:23:38.7392865Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:38.7393174Z BUILD_ENV: build_binary 2025-05-07T20:23:38.7393429Z BUILD_TARGET: genai 2025-05-07T20:23:38.7393652Z BUILD_VARIANT: cuda 2025-05-07T20:23:38.7393900Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:38.7394164Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:38.7394406Z DRIVER_VERSION: 570.133.07 2025-05-07T20:23:38.7412360Z ##[endgroup] 2025-05-07T20:23:38.8252485Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run == 2025-05-07T20:23:38.8254172Z + pre_install_nvidia_driver_amzn2 2025-05-07T20:23:38.8254668Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-05-07T20:23:39.1704686Z No match for argument: nvidia-driver-latest-dkms 2025-05-07T20:23:39.1705343Z No packages marked for removal. 2025-05-07T20:23:39.1769545Z Dependencies resolved. 2025-05-07T20:23:39.1779691Z Nothing to do. 2025-05-07T20:23:39.1780396Z Complete! 2025-05-07T20:23:39.2611021Z + install_nvidia_driver_common 2025-05-07T20:23:39.2614880Z + echo 'Before installing NVIDIA driver' 2025-05-07T20:23:39.2615419Z + lspci 2025-05-07T20:23:39.2617414Z Before installing NVIDIA driver 2025-05-07T20:23:39.2796820Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:39.2798239Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:39.2799274Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:39.2800460Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:23:39.2801538Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:23:39.2802493Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:39.2803375Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:39.2804251Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:23:39.2804983Z + lsmod 2025-05-07T20:23:39.2851462Z Module Size Used by 2025-05-07T20:23:39.2852067Z xt_nat 16384 0 2025-05-07T20:23:39.2852590Z nvidia_modeset 1716224 0 2025-05-07T20:23:39.2853132Z video 65536 1 nvidia_modeset 2025-05-07T20:23:39.2853739Z wmi 36864 1 video 2025-05-07T20:23:39.2854271Z nvidia_uvm 1884160 0 2025-05-07T20:23:39.2855026Z nvidia 11583488 2 nvidia_uvm,nvidia_modeset 2025-05-07T20:23:39.2855669Z drm 602112 1 nvidia 2025-05-07T20:23:39.2856269Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:23:39.2856927Z backlight 24576 3 video,drm,nvidia_modeset 2025-05-07T20:23:39.2857318Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:23:39.2857605Z veth 36864 0 2025-05-07T20:23:39.2857858Z xt_conntrack 16384 1 2025-05-07T20:23:39.2858111Z nft_chain_nat 16384 3 2025-05-07T20:23:39.2858371Z xt_MASQUERADE 20480 1 2025-05-07T20:23:39.2858680Z nf_nat 57344 3 xt_nat,nft_chain_nat,xt_MASQUERADE 2025-05-07T20:23:39.2859016Z nf_conntrack_netlink 57344 0 2025-05-07T20:23:39.2859644Z nf_conntrack 184320 5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:23:39.2860103Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:23:39.2860420Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:23:39.2860707Z xfrm_user 57344 1 2025-05-07T20:23:39.2860974Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:23:39.2861265Z xt_addrtype 16384 2 2025-05-07T20:23:39.2861520Z nft_compat 20480 4 2025-05-07T20:23:39.2861824Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:23:39.2862241Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:23:39.2862618Z br_netfilter 36864 0 2025-05-07T20:23:39.2862898Z bridge 323584 1 br_netfilter 2025-05-07T20:23:39.2863199Z stp 16384 1 bridge 2025-05-07T20:23:39.2863490Z llc 16384 2 bridge,stp 2025-05-07T20:23:39.2863776Z overlay 167936 0 2025-05-07T20:23:39.2864029Z tls 135168 0 2025-05-07T20:23:39.2864283Z nls_ascii 16384 1 2025-05-07T20:23:39.2864529Z nls_cp437 20480 1 2025-05-07T20:23:39.2864778Z vfat 24576 1 2025-05-07T20:23:39.2865035Z fat 86016 1 vfat 2025-05-07T20:23:39.2865304Z ena 180224 0 2025-05-07T20:23:39.2865550Z i8042 45056 0 2025-05-07T20:23:39.2865802Z serio 28672 3 i8042 2025-05-07T20:23:39.2866066Z button 24576 0 2025-05-07T20:23:39.2866323Z ghash_clmulni_intel 16384 0 2025-05-07T20:23:39.2866583Z sunrpc 696320 1 2025-05-07T20:23:39.2866830Z sch_fq_codel 20480 17 2025-05-07T20:23:39.2867087Z dm_mod 188416 0 2025-05-07T20:23:39.2867331Z fuse 163840 1 2025-05-07T20:23:39.2867568Z loop 36864 0 2025-05-07T20:23:39.2867816Z configfs 57344 1 2025-05-07T20:23:39.2868225Z dax 45056 1 dm_mod 2025-05-07T20:23:39.2868540Z dmi_sysfs 20480 0 2025-05-07T20:23:39.2868939Z crc32_pclmul 16384 0 2025-05-07T20:23:39.2869189Z crc32c_intel 24576 0 2025-05-07T20:23:39.2869439Z efivarfs 24576 1 2025-05-07T20:23:39.2869696Z + modinfo nvidia 2025-05-07T20:23:39.2872378Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:23:39.2872871Z import_ns: DMA_BUF 2025-05-07T20:23:39.2873122Z alias: char-major-195-* 2025-05-07T20:23:39.2873392Z version: 570.133.07 2025-05-07T20:23:39.2873632Z supported: external 2025-05-07T20:23:39.2873880Z license: Dual MIT/GPL 2025-05-07T20:23:39.2874169Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:23:39.2874504Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:23:39.2874828Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:23:39.2875161Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:23:39.2875505Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:23:39.2875855Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:23:39.2876166Z depends: i2c-core,drm 2025-05-07T20:23:39.2876462Z retpoline: Y 2025-05-07T20:23:39.2876679Z name: nvidia 2025-05-07T20:23:39.2877039Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:23:39.2877524Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:23:39.2877969Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:23:39.2878391Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:23:39.2878702Z parm: NVreg_RmLogonRC:int 2025-05-07T20:23:39.2878996Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:23:39.2879314Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:23:39.2879619Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:23:39.2880031Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:23:39.2880396Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:23:39.2880785Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:23:39.2881120Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:23:39.2881415Z parm: NVreg_EnableMSI:int 2025-05-07T20:23:39.2881728Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:23:39.2882090Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:23:39.2882488Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:23:39.2882869Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:23:39.2883284Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:39.2883684Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:23:39.2884105Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:39.2884517Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:23:39.2884860Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:23:39.2885231Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:23:39.2885605Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:23:39.2885947Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:23:39.2886261Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:23:39.2886592Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:23:39.2886916Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:23:39.2887227Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:23:39.2887574Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:23:39.2887938Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:23:39.2888269Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:23:39.2888595Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:23:39.2888938Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:23:39.2889276Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:23:39.2889615Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:23:39.2890044Z parm: NVreg_RmMsg:charp 2025-05-07T20:23:39.2890335Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:23:39.2890652Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:23:39.2890981Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:23:39.2891297Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:23:39.2891623Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:23:39.2891985Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:23:39.2892343Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:23:39.2892674Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:23:39.2893029Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:23:39.2893378Z parm: rm_firmware_active:charp 2025-05-07T20:23:39.2893674Z + HAS_NVIDIA_DRIVER=0 2025-05-07T20:23:39.2893912Z ++ command -v nvidia-smi 2025-05-07T20:23:39.2894181Z + '[' -x /usr/bin/nvidia-smi ']' 2025-05-07T20:23:39.2894546Z + set +e 2025-05-07T20:23:39.2894864Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-05-07T20:23:40.9695526Z + INSTALLED_DRIVER_VERSION=570.133.07 2025-05-07T20:23:40.9695976Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:40.9696300Z + '[' 0 -ne 0 ']' 2025-05-07T20:23:40.9696542Z + '[' 570.133.07 '!=' 570.133.07 ']' 2025-05-07T20:23:40.9696806Z + HAS_NVIDIA_DRIVER=1 2025-05-07T20:23:40.9697382Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation' 2025-05-07T20:23:40.9698054Z + set -e 2025-05-07T20:23:40.9698316Z + '[' 1 -eq 0 ']' 2025-05-07T20:23:40.9698702Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation 2025-05-07T20:23:40.9699175Z + post_install_nvidia_driver_common 2025-05-07T20:23:40.9701467Z + sudo modprobe nvidia 2025-05-07T20:23:41.0724250Z + echo 'After installing NVIDIA driver' 2025-05-07T20:23:41.0725299Z + lspci 2025-05-07T20:23:41.0725772Z After installing NVIDIA driver 2025-05-07T20:23:41.0843138Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:41.0843659Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:41.0844218Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:41.0844749Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-05-07T20:23:41.0845229Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-05-07T20:23:41.0845765Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:41.0846256Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:41.0846738Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-05-07T20:23:41.0847139Z + lsmod 2025-05-07T20:23:41.0875546Z Module Size Used by 2025-05-07T20:23:41.0875880Z xt_nat 16384 0 2025-05-07T20:23:41.0876150Z nvidia_modeset 1716224 0 2025-05-07T20:23:41.0876433Z video 65536 1 nvidia_modeset 2025-05-07T20:23:41.0876739Z wmi 36864 1 video 2025-05-07T20:23:41.0877016Z nvidia_uvm 1884160 0 2025-05-07T20:23:41.0877312Z nvidia 11583488 2 nvidia_uvm,nvidia_modeset 2025-05-07T20:23:41.0877644Z drm 602112 1 nvidia 2025-05-07T20:23:41.0877948Z drm_panel_orientation_quirks 32768 1 drm 2025-05-07T20:23:41.0878305Z backlight 24576 3 video,drm,nvidia_modeset 2025-05-07T20:23:41.0878656Z i2c_core 110592 2 nvidia,drm 2025-05-07T20:23:41.0878943Z veth 36864 0 2025-05-07T20:23:41.0879195Z xt_conntrack 16384 1 2025-05-07T20:23:41.0879456Z nft_chain_nat 16384 3 2025-05-07T20:23:41.0879718Z xt_MASQUERADE 20480 1 2025-05-07T20:23:41.0880030Z nf_nat 57344 3 xt_nat,nft_chain_nat,xt_MASQUERADE 2025-05-07T20:23:41.0880376Z nf_conntrack_netlink 57344 0 2025-05-07T20:23:41.0881026Z nf_conntrack 184320 5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-05-07T20:23:41.0881498Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-05-07T20:23:41.0881804Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-05-07T20:23:41.0882102Z xfrm_user 57344 1 2025-05-07T20:23:41.0882371Z xfrm_algo 16384 1 xfrm_user 2025-05-07T20:23:41.0882654Z xt_addrtype 16384 2 2025-05-07T20:23:41.0882917Z nft_compat 20480 4 2025-05-07T20:23:41.0883227Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-05-07T20:23:41.0883646Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-05-07T20:23:41.0884015Z br_netfilter 36864 0 2025-05-07T20:23:41.0884298Z bridge 323584 1 br_netfilter 2025-05-07T20:23:41.0884603Z stp 16384 1 bridge 2025-05-07T20:23:41.0884889Z llc 16384 2 bridge,stp 2025-05-07T20:23:41.0885191Z overlay 167936 0 2025-05-07T20:23:41.0885461Z tls 135168 0 2025-05-07T20:23:41.0885712Z nls_ascii 16384 1 2025-05-07T20:23:41.0885987Z nls_cp437 20480 1 2025-05-07T20:23:41.0886245Z vfat 24576 1 2025-05-07T20:23:41.0886498Z fat 86016 1 vfat 2025-05-07T20:23:41.0886780Z ena 180224 0 2025-05-07T20:23:41.0887033Z i8042 45056 0 2025-05-07T20:23:41.0887284Z serio 28672 3 i8042 2025-05-07T20:23:41.0887573Z button 24576 0 2025-05-07T20:23:41.0887844Z ghash_clmulni_intel 16384 0 2025-05-07T20:23:41.0888119Z sunrpc 696320 1 2025-05-07T20:23:41.0888378Z sch_fq_codel 20480 17 2025-05-07T20:23:41.0888653Z dm_mod 188416 0 2025-05-07T20:23:41.0888915Z fuse 163840 1 2025-05-07T20:23:41.0889178Z loop 36864 0 2025-05-07T20:23:41.0889620Z configfs 57344 1 2025-05-07T20:23:41.0889893Z dax 45056 1 dm_mod 2025-05-07T20:23:41.0890179Z dmi_sysfs 20480 0 2025-05-07T20:23:41.0890445Z crc32_pclmul 16384 0 2025-05-07T20:23:41.0890709Z crc32c_intel 24576 0 2025-05-07T20:23:41.0890957Z efivarfs 24576 1 2025-05-07T20:23:41.0891201Z + modinfo nvidia 2025-05-07T20:23:41.0892160Z filename: /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-05-07T20:23:41.0892619Z import_ns: DMA_BUF 2025-05-07T20:23:41.0892859Z alias: char-major-195-* 2025-05-07T20:23:41.0893147Z version: 570.133.07 2025-05-07T20:23:41.0893394Z supported: external 2025-05-07T20:23:41.0893637Z license: Dual MIT/GPL 2025-05-07T20:23:41.0893921Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-05-07T20:23:41.0894264Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-05-07T20:23:41.0894736Z srcversion: 49515739FD8F721A3F2F714 2025-05-07T20:23:41.0895060Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-05-07T20:23:41.0895409Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-05-07T20:23:41.0895752Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-05-07T20:23:41.0896060Z depends: i2c-core,drm 2025-05-07T20:23:41.0896314Z retpoline: Y 2025-05-07T20:23:41.0896534Z name: nvidia 2025-05-07T20:23:41.0896889Z vermagic: 6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-05-07T20:23:41.0897416Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-05-07T20:23:41.0897866Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-05-07T20:23:41.0898287Z parm: NVreg_ResmanDebugLevel:int 2025-05-07T20:23:41.0898592Z parm: NVreg_RmLogonRC:int 2025-05-07T20:23:41.0898901Z parm: NVreg_ModifyDeviceFiles:int 2025-05-07T20:23:41.0899220Z parm: NVreg_DeviceFileUID:int 2025-05-07T20:23:41.0899528Z parm: NVreg_DeviceFileGID:int 2025-05-07T20:23:41.0899835Z parm: NVreg_DeviceFileMode:int 2025-05-07T20:23:41.0900313Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-05-07T20:23:41.0900702Z parm: NVreg_UsePageAttributeTable:int 2025-05-07T20:23:41.0901043Z parm: NVreg_EnablePCIeGen3:int 2025-05-07T20:23:41.0901350Z parm: NVreg_EnableMSI:int 2025-05-07T20:23:41.0901651Z parm: NVreg_EnableStreamMemOPs:int 2025-05-07T20:23:41.0902020Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-05-07T20:23:41.0902428Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-05-07T20:23:41.0902823Z parm: NVreg_EnableS0ixPowerManagement:int 2025-05-07T20:23:41.0903237Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:41.0903647Z parm: NVreg_DynamicPowerManagement:int 2025-05-07T20:23:41.0904076Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-05-07T20:23:41.0904493Z parm: NVreg_EnableGpuFirmware:int 2025-05-07T20:23:41.0904837Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-05-07T20:23:41.0905208Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-05-07T20:23:41.0905578Z parm: NVreg_EnableUserNUMAManagement:int 2025-05-07T20:23:41.0905922Z parm: NVreg_MemoryPoolSize:int 2025-05-07T20:23:41.0906247Z parm: NVreg_KMallocHeapMaxSize:int 2025-05-07T20:23:41.0906582Z parm: NVreg_VMallocHeapMaxSize:int 2025-05-07T20:23:41.0906898Z parm: NVreg_IgnoreMMIOCheck:int 2025-05-07T20:23:41.0907218Z parm: NVreg_NvLinkDisable:int 2025-05-07T20:23:41.0907574Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-05-07T20:23:41.0907938Z parm: NVreg_RegisterPCIDriver:int 2025-05-07T20:23:41.0908274Z parm: NVreg_EnableResizableBar:int 2025-05-07T20:23:41.0908627Z parm: NVreg_EnableDbgBreakpoint:int 2025-05-07T20:23:41.0908973Z parm: NVreg_EnableNonblockingOpen:int 2025-05-07T20:23:41.0909421Z parm: NVreg_RegistryDwords:charp 2025-05-07T20:23:41.0909780Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-05-07T20:23:41.0910126Z parm: NVreg_RmMsg:charp 2025-05-07T20:23:41.0910412Z parm: NVreg_GpuBlacklist:charp 2025-05-07T20:23:41.0910748Z parm: NVreg_TemporaryFilePath:charp 2025-05-07T20:23:41.0911085Z parm: NVreg_ExcludedGpus:charp 2025-05-07T20:23:41.0911403Z parm: NVreg_DmaRemapPeerMmio:int 2025-05-07T20:23:41.0911745Z parm: NVreg_RmNvlinkBandwidth:charp 2025-05-07T20:23:41.0912111Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-05-07T20:23:41.0912465Z parm: NVreg_ImexChannelCount:int 2025-05-07T20:23:41.0912803Z parm: NVreg_CreateImexChannel0:int 2025-05-07T20:23:41.0913168Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-05-07T20:23:41.0913509Z parm: rm_firmware_active:charp 2025-05-07T20:23:41.0913803Z + set +e 2025-05-07T20:23:41.0914016Z + nvidia-smi 2025-05-07T20:23:42.5074936Z Wed May 7 20:23:42 2025 2025-05-07T20:23:42.5075383Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:42.5075894Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:42.5076397Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:42.5076900Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:42.5077433Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:42.5077881Z | | | MIG M. | 2025-05-07T20:23:42.5078218Z |=========================================+========================+======================| 2025-05-07T20:23:42.5138935Z | 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:42.5139923Z | 0% 30C P0 62W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:42.5140365Z | | | N/A | 2025-05-07T20:23:42.5140812Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:42.5141265Z 2025-05-07T20:23:42.5141713Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:42.5142203Z | Processes: | 2025-05-07T20:23:42.5142708Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:42.5143173Z | ID ID Usage | 2025-05-07T20:23:42.5143571Z |=========================================================================================| 2025-05-07T20:23:42.5144066Z | No running processes found | 2025-05-07T20:23:42.5144552Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:42.9201469Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-05-07T20:23:44.3326182Z NVIDIA A10G 2025-05-07T20:23:44.6034361Z + NVIDIA_SMI_STATUS=0 2025-05-07T20:23:44.6034723Z + '[' 0 -eq 0 ']' 2025-05-07T20:23:44.6035073Z + echo 'INFO: Ignoring allowed status 0' 2025-05-07T20:23:44.6035481Z + set -e 2025-05-07T20:23:44.6035778Z INFO: Ignoring allowed status 0 2025-05-07T20:23:44.6042785Z == Installing nvidia container toolkit for amzn2023 == 2025-05-07T20:23:44.6047193Z + sudo yum install -y yum-utils 2025-05-07T20:23:45.0258286Z Last metadata expiration check: 0:17:46 ago on Wed May 7 20:05:59 2025. 2025-05-07T20:23:45.0507342Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-05-07T20:23:45.0909964Z Dependencies resolved. 2025-05-07T20:23:45.1092439Z Nothing to do. 2025-05-07T20:23:45.1092896Z Complete! 2025-05-07T20:23:45.1492622Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-05-07T20:23:45.1493200Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:45.1494093Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:45.5029162Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-05-07T20:23:45.5583619Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 2025-05-07T20:23:46.1055726Z nvidia-container-toolkit 14 kB/s | 833 B 00:00 2025-05-07T20:23:46.1305548Z Package nvidia-docker2-2.14.0-1.noarch is already installed. 2025-05-07T20:23:46.1712812Z Dependencies resolved. 2025-05-07T20:23:46.1895115Z ================================================================================ 2025-05-07T20:23:46.1895585Z Package Arch Version Repository Size 2025-05-07T20:23:46.1895983Z ================================================================================ 2025-05-07T20:23:46.1896315Z Downgrading: 2025-05-07T20:23:46.1896709Z nvidia-container-toolkit x86_64 1.16.2-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:46.1897317Z nvidia-container-toolkit-base x86_64 1.16.2-1 nvidia-container-toolkit 5.6 M 2025-05-07T20:23:46.1897690Z 2025-05-07T20:23:46.1897788Z Transaction Summary 2025-05-07T20:23:46.1898053Z ================================================================================ 2025-05-07T20:23:46.1898373Z Downgrade 2 Packages 2025-05-07T20:23:46.1898528Z 2025-05-07T20:23:46.1898632Z Total download size: 6.8 M 2025-05-07T20:23:46.1899099Z Downloading Packages: 2025-05-07T20:23:46.2587912Z (1/2): nvidia-container-toolkit-base-1.16.2-1.x 83 MB/s | 5.6 MB 00:00 2025-05-07T20:23:46.2670214Z (2/2): nvidia-container-toolkit-1.16.2-1.x86_64 16 MB/s | 1.2 MB 00:00 2025-05-07T20:23:46.2679675Z -------------------------------------------------------------------------------- 2025-05-07T20:23:46.2682631Z Total 88 MB/s | 6.8 MB 00:00 2025-05-07T20:23:46.2685078Z Running transaction check 2025-05-07T20:23:46.2788687Z Transaction check succeeded. 2025-05-07T20:23:46.2789105Z Running transaction test 2025-05-07T20:23:46.3082101Z Transaction test succeeded. 2025-05-07T20:23:46.3084568Z Running transaction 2025-05-07T20:23:46.8572998Z Preparing : 1/1 2025-05-07T20:23:46.9645361Z Downgrading : nvidia-container-toolkit-base-1.16.2-1.x86_64 1/4 2025-05-07T20:23:46.9682100Z Downgrading : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:46.9906532Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:46.9907375Z Cleanup : nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:47.0016140Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 3/4 2025-05-07T20:23:47.0046196Z Cleanup : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4 2025-05-07T20:23:47.1760339Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 4/4 2025-05-07T20:23:47.1761154Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 1/4 2025-05-07T20:23:47.1761808Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:47.1762359Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 3/4 2025-05-07T20:23:47.3068238Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 4/4================================================================================ 2025-05-07T20:23:47.3069290Z WARNING: 2025-05-07T20:23:47.3069534Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:47.3069784Z 2025-05-07T20:23:47.3069881Z Available Versions: 2025-05-07T20:23:47.3070027Z 2025-05-07T20:23:47.3070117Z Version 2023.7.20250331: 2025-05-07T20:23:47.3070433Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:47.3070691Z 2025-05-07T20:23:47.3070817Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:47.3071031Z 2025-05-07T20:23:47.3071126Z Release notes: 2025-05-07T20:23:47.3071537Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:47.3071926Z 2025-05-07T20:23:47.3072020Z Version 2023.7.20250414: 2025-05-07T20:23:47.3072328Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:47.3072581Z 2025-05-07T20:23:47.3072696Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:47.3072919Z 2025-05-07T20:23:47.3073012Z Release notes: 2025-05-07T20:23:47.3073426Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:47.3073806Z 2025-05-07T20:23:47.3073902Z Version 2023.7.20250428: 2025-05-07T20:23:47.3074209Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:47.3074470Z 2025-05-07T20:23:47.3074586Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:47.3074799Z 2025-05-07T20:23:47.3074895Z Release notes: 2025-05-07T20:23:47.3075293Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:47.3075676Z 2025-05-07T20:23:47.3075787Z ================================================================================ 2025-05-07T20:23:47.3431157Z 2025-05-07T20:23:47.3431377Z 2025-05-07T20:23:47.3431503Z Downgraded: 2025-05-07T20:23:47.3431892Z nvidia-container-toolkit-1.16.2-1.x86_64 2025-05-07T20:23:47.3432477Z nvidia-container-toolkit-base-1.16.2-1.x86_64 2025-05-07T20:23:47.3432861Z 2025-05-07T20:23:47.3432951Z Complete! 2025-05-07T20:23:47.3903268Z + sudo systemctl restart docker 2025-05-07T20:23:52.2823744Z Wed May 7 20:23:52 2025 2025-05-07T20:23:52.2824265Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:52.2824869Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:52.2825372Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:52.2826123Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:52.2826673Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:52.2827125Z | | | MIG M. | 2025-05-07T20:23:52.2827471Z |=========================================+========================+======================| 2025-05-07T20:23:52.2906970Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:52.2907595Z | 0% 30C P0 62W / 300W | 0MiB / 23028MiB | 4% Default | 2025-05-07T20:23:52.2908142Z | | | N/A | 2025-05-07T20:23:52.2908695Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:52.2909200Z 2025-05-07T20:23:52.2909597Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:52.2910041Z | Processes: | 2025-05-07T20:23:52.2910498Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:52.2911267Z | ID ID Usage | 2025-05-07T20:23:52.2911666Z |=========================================================================================| 2025-05-07T20:23:52.2912283Z | No running processes found | 2025-05-07T20:23:52.2912948Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:52.8019215Z Command completed after 1 attempt(s). 2025-05-07T20:23:52.8111624Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:52.8112094Z . $PRELUDE; print_system_info; print_ec2_info 2025-05-07T20:23:52.8126943Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:52.8127317Z env: 2025-05-07T20:23:52.8127558Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:52.8127872Z BUILD_ENV: build_binary 2025-05-07T20:23:52.8128131Z BUILD_TARGET: genai 2025-05-07T20:23:52.8128385Z BUILD_VARIANT: cuda 2025-05-07T20:23:52.8128625Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:52.8128891Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:52.8129209Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:52.8129551Z ##[endgroup] 2025-05-07T20:23:53.1524368Z ################################################################################ 2025-05-07T20:23:53.1524758Z # Print System Info 2025-05-07T20:23:53.1524983Z # 2025-05-07T20:23:53.1541284Z # [2025-05-07T20:23:53.153Z] + print_system_info 2025-05-07T20:23:53.1541659Z ################################################################################ 2025-05-07T20:23:53.1541885Z 2025-05-07T20:23:53.1541998Z ################################################################################ 2025-05-07T20:23:53.1542339Z [INFO] Printing environment variables ... 2025-05-07T20:23:53.1542646Z + printenv 2025-05-07T20:23:53.1542763Z 2025-05-07T20:23:53.1563791Z SHELL=/bin/bash 2025-05-07T20:23:53.1564174Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:23:53.1564579Z BUILD_VARIANT=cuda 2025-05-07T20:23:53.1565117Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_a588dc6e-a5c1-454e-a010-4d33cdea37e3 2025-05-07T20:23:53.1565710Z GITHUB_ACTION=__run 2025-05-07T20:23:53.1566001Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:53.1566341Z GITHUB_RUN_NUMBER=10601 2025-05-07T20:23:53.1566586Z RUNNER_NAME=i-02a13dec7b575dc8f 2025-05-07T20:23:53.1566878Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T20:23:53.1567185Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T20:23:53.1567513Z MACHINE_NAME_LC=x86_64 2025-05-07T20:23:53.1568036Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh 2025-05-07T20:23:53.1568642Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T20:23:53.1568971Z PRELUDE=.github/scripts/setup_env.bash 2025-05-07T20:23:53.1569273Z GITHUB_REF_TYPE=branch 2025-05-07T20:23:53.1569738Z *** 2025-05-07T20:23:53.1569991Z LOGNAME=ec2-user 2025-05-07T20:23:53.1570232Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T20:23:53.1570497Z ENFORCE_CUDA_DEVICE=1 2025-05-07T20:23:53.1570735Z GITHUB_ACTIONS=true 2025-05-07T20:23:53.1570962Z SYSTEMD_EXEC_PID=55382 2025-05-07T20:23:53.1571242Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:23:53.1571808Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge 2025-05-07T20:23:53.1572339Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T20:23:53.1572621Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T20:23:53.1572887Z RUNNER_OS=Linux 2025-05-07T20:23:53.1573116Z GITHUB_REF_PROTECTED=false 2025-05-07T20:23:53.1573362Z HOME=/home/ec2-user 2025-05-07T20:23:53.1573628Z GITHUB_API_URL=https://api.github.com 2025-05-07T20:23:53.1573932Z LANG=C.UTF-8 2025-05-07T20:23:53.1574229Z RUNNER_TRACKING_ID=github_9125985d-0653-4ab0-94d0-9e9fb9cb14a2 2025-05-07T20:23:53.1574706Z RUNNER_ARCH=X64 2025-05-07T20:23:53.1574980Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp 2025-05-07T20:23:53.1575637Z BUILD_TARGET=genai 2025-05-07T20:23:53.1576179Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_a588dc6e-a5c1-454e-a010-4d33cdea37e3 2025-05-07T20:23:53.1577085Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_a588dc6e-a5c1-454e-a010-4d33cdea37e3 2025-05-07T20:23:53.1577861Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-05-07T20:23:53.1578550Z INVOCATION_ID=b55d6cb2507b4fe896b3815e87d2f4e7 2025-05-07T20:23:53.1578892Z GITHUB_EVENT_NAME=pull_request 2025-05-07T20:23:53.1579167Z GITHUB_RUN_ID=14891846252 2025-05-07T20:23:53.1579772Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_a588dc6e-a5c1-454e-a010-4d33cdea37e3 2025-05-07T20:23:53.1580415Z BUILD_ENV=build_binary 2025-05-07T20:23:53.1580656Z GITHUB_ACTOR=q10 2025-05-07T20:23:53.1580888Z GITHUB_RUN_ATTEMPT=1 2025-05-07T20:23:53.1581129Z KERN_NAME_LC=linux 2025-05-07T20:23:53.1581363Z BUILD_CUDA_VERSION=12.6.3 2025-05-07T20:23:53.1581675Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T20:23:53.1582030Z PLATFORM_NAME=Linux-x86_64 2025-05-07T20:23:53.1582290Z USER=ec2-user 2025-05-07T20:23:53.1582534Z GITHUB_SERVER_URL=https://github.com 2025-05-07T20:23:53.1582824Z SHLVL=1 2025-05-07T20:23:53.1583032Z GITHUB_ACTOR_ID=255046 2025-05-07T20:23:53.1583356Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool 2025-05-07T20:23:53.1583840Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T20:23:53.1584281Z GITHUB_REF_NAME=4066/merge 2025-05-07T20:23:53.1584526Z KERN_NAME=Linux 2025-05-07T20:23:53.1584761Z GITHUB_JOB=test_and_publish_artifact 2025-05-07T20:23:53.1585171Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh 2025-05-07T20:23:53.1585607Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T20:23:53.1585883Z GITHUB_RETENTION_DAYS=90 2025-05-07T20:23:53.1586120Z JOURNAL_STREAM=8:83617 2025-05-07T20:23:53.1586437Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM 2025-05-07T20:23:53.1586813Z GITHUB_ACTION_REPOSITORY= 2025-05-07T20:23:53.1587118Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin 2025-05-07T20:23:53.1587457Z GITHUB_BASE_REF=main 2025-05-07T20:23:53.1587674Z CI=true 2025-05-07T20:23:53.1587878Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T20:23:53.1588163Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T20:23:53.1588449Z GITHUB_ACTION_REF= 2025-05-07T20:23:53.1588691Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI 2025-05-07T20:23:53.1589317Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_a588dc6e-a5c1-454e-a010-4d33cdea37e3 2025-05-07T20:23:53.1589923Z MACHINE_NAME=x86_64 2025-05-07T20:23:53.1590142Z _=/usr/bin/printenv 2025-05-07T20:23:53.1590272Z 2025-05-07T20:23:53.1590386Z ################################################################################ 2025-05-07T20:23:53.1590708Z [INFO] Print ldd version ... 2025-05-07T20:23:53.1590974Z + ldd --version 2025-05-07T20:23:53.1591101Z 2025-05-07T20:23:53.1591198Z ldd (GNU libc) 2.34 2025-05-07T20:23:53.1591473Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:23:53.1591934Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:23:53.1592491Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:23:53.1592957Z Written by Roland McGrath and Ulrich Drepper. 2025-05-07T20:23:53.1593193Z 2025-05-07T20:23:53.1593312Z ################################################################################ 2025-05-07T20:23:53.1593635Z [INFO] Print CPU info ... 2025-05-07T20:23:53.1593880Z + nproc 2025-05-07T20:23:53.1593997Z 2025-05-07T20:23:53.1611258Z 16 2025-05-07T20:23:53.1612892Z 2025-05-07T20:23:53.1613077Z + lscpu 2025-05-07T20:23:53.1613191Z 2025-05-07T20:23:53.1724611Z Architecture: x86_64 2025-05-07T20:23:53.1725374Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:23:53.1727105Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1727920Z Byte Order: Little Endian 2025-05-07T20:23:53.1728773Z CPU(s): 16 2025-05-07T20:23:53.1729431Z On-line CPU(s) list: 0-15 2025-05-07T20:23:53.1730067Z Vendor ID: AuthenticAMD 2025-05-07T20:23:53.1730473Z Model name: AMD EPYC 7R32 2025-05-07T20:23:53.1730814Z CPU family: 23 2025-05-07T20:23:53.1731440Z Model: 49 2025-05-07T20:23:53.1731748Z Thread(s) per core: 2 2025-05-07T20:23:53.1732049Z Core(s) per socket: 8 2025-05-07T20:23:53.1732344Z Socket(s): 1 2025-05-07T20:23:53.1732628Z Stepping: 0 2025-05-07T20:23:53.1732937Z BogoMIPS: 5599.29 2025-05-07T20:23:53.1735298Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1737551Z Hypervisor vendor: KVM 2025-05-07T20:23:53.1737876Z Virtualization type: full 2025-05-07T20:23:53.1738268Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:23:53.1738645Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:23:53.1739020Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:23:53.1739391Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:23:53.1739725Z NUMA node(s): 1 2025-05-07T20:23:53.1740027Z NUMA node0 CPU(s): 0-15 2025-05-07T20:23:53.1740476Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:23:53.1741024Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:23:53.1741538Z Vulnerability L1tf: Not affected 2025-05-07T20:23:53.1742034Z Vulnerability Mds: Not affected 2025-05-07T20:23:53.1742539Z Vulnerability Meltdown: Not affected 2025-05-07T20:23:53.1743036Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:23:53.1743540Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:23:53.1744097Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:23:53.1744687Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:23:53.1745248Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:23:53.1745951Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:23:53.1746833Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:23:53.1747521Z Vulnerability Srbds: Not affected 2025-05-07T20:23:53.1747890Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:23:53.1748208Z 2025-05-07T20:23:53.1748311Z + cat /proc/cpuinfo 2025-05-07T20:23:53.1748449Z 2025-05-07T20:23:53.1748538Z processor : 0 2025-05-07T20:23:53.1748751Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1748994Z cpu family : 23 2025-05-07T20:23:53.1749203Z model : 49 2025-05-07T20:23:53.1749404Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1749651Z stepping : 0 2025-05-07T20:23:53.1749859Z microcode : 0x830107f 2025-05-07T20:23:53.1750196Z cpu MHz : 2897.609 2025-05-07T20:23:53.1750417Z cache size : 512 KB 2025-05-07T20:23:53.1750630Z physical id : 0 2025-05-07T20:23:53.1750830Z siblings : 16 2025-05-07T20:23:53.1751033Z core id : 0 2025-05-07T20:23:53.1751232Z cpu cores : 8 2025-05-07T20:23:53.1751430Z apicid : 0 2025-05-07T20:23:53.1751630Z initial apicid : 0 2025-05-07T20:23:53.1751838Z fpu : yes 2025-05-07T20:23:53.1752037Z fpu_exception : yes 2025-05-07T20:23:53.1752255Z cpuid level : 13 2025-05-07T20:23:53.1752459Z wp : yes 2025-05-07T20:23:53.1754653Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1757060Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1757571Z bogomips : 5599.29 2025-05-07T20:23:53.1757797Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1758039Z clflush size : 64 2025-05-07T20:23:53.1758252Z cache_alignment : 64 2025-05-07T20:23:53.1758527Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1758864Z power management: 2025-05-07T20:23:53.1758996Z 2025-05-07T20:23:53.1759082Z processor : 1 2025-05-07T20:23:53.1759303Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1759544Z cpu family : 23 2025-05-07T20:23:53.1759749Z model : 49 2025-05-07T20:23:53.1759962Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1760216Z stepping : 0 2025-05-07T20:23:53.1760420Z microcode : 0x830107f 2025-05-07T20:23:53.1760675Z cpu MHz : 2154.332 2025-05-07T20:23:53.1760917Z cache size : 512 KB 2025-05-07T20:23:53.1761134Z physical id : 0 2025-05-07T20:23:53.1761345Z siblings : 16 2025-05-07T20:23:53.1761549Z core id : 1 2025-05-07T20:23:53.1761750Z cpu cores : 8 2025-05-07T20:23:53.1761954Z apicid : 2 2025-05-07T20:23:53.1762159Z initial apicid : 2 2025-05-07T20:23:53.1762367Z fpu : yes 2025-05-07T20:23:53.1762571Z fpu_exception : yes 2025-05-07T20:23:53.1762791Z cpuid level : 13 2025-05-07T20:23:53.1762997Z wp : yes 2025-05-07T20:23:53.1765099Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1767493Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1768000Z bogomips : 5599.29 2025-05-07T20:23:53.1768224Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1768460Z clflush size : 64 2025-05-07T20:23:53.1815890Z cache_alignment : 64 2025-05-07T20:23:53.1816250Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1816590Z power management: 2025-05-07T20:23:53.1816729Z 2025-05-07T20:23:53.1816837Z processor : 2 2025-05-07T20:23:53.1817061Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1817367Z cpu family : 23 2025-05-07T20:23:53.1817624Z model : 49 2025-05-07T20:23:53.1817848Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1818145Z stepping : 0 2025-05-07T20:23:53.1818437Z microcode : 0x830107f 2025-05-07T20:23:53.1818671Z cpu MHz : 2375.804 2025-05-07T20:23:53.1818898Z cache size : 512 KB 2025-05-07T20:23:53.1819125Z physical id : 0 2025-05-07T20:23:53.1819507Z siblings : 16 2025-05-07T20:23:53.1819716Z core id : 2 2025-05-07T20:23:53.1819920Z cpu cores : 8 2025-05-07T20:23:53.1820118Z apicid : 4 2025-05-07T20:23:53.1820320Z initial apicid : 4 2025-05-07T20:23:53.1820536Z fpu : yes 2025-05-07T20:23:53.1820730Z fpu_exception : yes 2025-05-07T20:23:53.1820949Z cpuid level : 13 2025-05-07T20:23:53.1821156Z wp : yes 2025-05-07T20:23:53.1823373Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1826097Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1826677Z bogomips : 5599.29 2025-05-07T20:23:53.1826918Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1827172Z clflush size : 64 2025-05-07T20:23:53.1827400Z cache_alignment : 64 2025-05-07T20:23:53.1827700Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1828055Z power management: 2025-05-07T20:23:53.1828199Z 2025-05-07T20:23:53.1828286Z processor : 3 2025-05-07T20:23:53.1828514Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1828776Z cpu family : 23 2025-05-07T20:23:53.1828988Z model : 49 2025-05-07T20:23:53.1829205Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1829463Z stepping : 0 2025-05-07T20:23:53.1829677Z microcode : 0x830107f 2025-05-07T20:23:53.1829917Z cpu MHz : 3289.591 2025-05-07T20:23:53.1830141Z cache size : 512 KB 2025-05-07T20:23:53.1830362Z physical id : 0 2025-05-07T20:23:53.1830583Z siblings : 16 2025-05-07T20:23:53.1830792Z core id : 3 2025-05-07T20:23:53.1830998Z cpu cores : 8 2025-05-07T20:23:53.1831209Z apicid : 6 2025-05-07T20:23:53.1831422Z initial apicid : 6 2025-05-07T20:23:53.1831642Z fpu : yes 2025-05-07T20:23:53.1831849Z fpu_exception : yes 2025-05-07T20:23:53.1832077Z cpuid level : 13 2025-05-07T20:23:53.1832299Z wp : yes 2025-05-07T20:23:53.1834752Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1837130Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1837636Z bogomips : 5599.29 2025-05-07T20:23:53.1837856Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1838081Z clflush size : 64 2025-05-07T20:23:53.1838297Z cache_alignment : 64 2025-05-07T20:23:53.1838567Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1838882Z power management: 2025-05-07T20:23:53.1839017Z 2025-05-07T20:23:53.1839095Z processor : 4 2025-05-07T20:23:53.1839305Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1839541Z cpu family : 23 2025-05-07T20:23:53.1839735Z model : 49 2025-05-07T20:23:53.1839958Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1840234Z stepping : 0 2025-05-07T20:23:53.1840432Z microcode : 0x830107f 2025-05-07T20:23:53.1840653Z cpu MHz : 3299.649 2025-05-07T20:23:53.1840859Z cache size : 512 KB 2025-05-07T20:23:53.1841063Z physical id : 0 2025-05-07T20:23:53.1841273Z siblings : 16 2025-05-07T20:23:53.1841471Z core id : 4 2025-05-07T20:23:53.1841658Z cpu cores : 8 2025-05-07T20:23:53.1841854Z apicid : 8 2025-05-07T20:23:53.1842251Z initial apicid : 8 2025-05-07T20:23:53.1842457Z fpu : yes 2025-05-07T20:23:53.1842717Z fpu_exception : yes 2025-05-07T20:23:53.1842939Z cpuid level : 13 2025-05-07T20:23:53.1843133Z wp : yes 2025-05-07T20:23:53.1845348Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1847734Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1848239Z bogomips : 5599.29 2025-05-07T20:23:53.1848470Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1848703Z clflush size : 64 2025-05-07T20:23:53.1848924Z cache_alignment : 64 2025-05-07T20:23:53.1849193Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1849513Z power management: 2025-05-07T20:23:53.1849698Z 2025-05-07T20:23:53.1849810Z processor : 5 2025-05-07T20:23:53.1850075Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1850312Z cpu family : 23 2025-05-07T20:23:53.1850514Z model : 49 2025-05-07T20:23:53.1850723Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1850969Z stepping : 0 2025-05-07T20:23:53.1851183Z microcode : 0x830107f 2025-05-07T20:23:53.1851410Z cpu MHz : 3298.817 2025-05-07T20:23:53.1851631Z cache size : 512 KB 2025-05-07T20:23:53.1851847Z physical id : 0 2025-05-07T20:23:53.1852068Z siblings : 16 2025-05-07T20:23:53.1852269Z core id : 5 2025-05-07T20:23:53.1852469Z cpu cores : 8 2025-05-07T20:23:53.1852671Z apicid : 10 2025-05-07T20:23:53.1852874Z initial apicid : 10 2025-05-07T20:23:53.1853094Z fpu : yes 2025-05-07T20:23:53.1853304Z fpu_exception : yes 2025-05-07T20:23:53.1853518Z cpuid level : 13 2025-05-07T20:23:53.1853727Z wp : yes 2025-05-07T20:23:53.1855931Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1858333Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1858844Z bogomips : 5599.29 2025-05-07T20:23:53.1859065Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1859312Z clflush size : 64 2025-05-07T20:23:53.1859545Z cache_alignment : 64 2025-05-07T20:23:53.1859820Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1860150Z power management: 2025-05-07T20:23:53.1860282Z 2025-05-07T20:23:53.1860381Z processor : 6 2025-05-07T20:23:53.1860668Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1860988Z cpu family : 23 2025-05-07T20:23:53.1861237Z model : 49 2025-05-07T20:23:53.1861432Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1861674Z stepping : 0 2025-05-07T20:23:53.1861873Z microcode : 0x830107f 2025-05-07T20:23:53.1862097Z cpu MHz : 1940.260 2025-05-07T20:23:53.1862304Z cache size : 512 KB 2025-05-07T20:23:53.1862511Z physical id : 0 2025-05-07T20:23:53.1862707Z siblings : 16 2025-05-07T20:23:53.1862900Z core id : 6 2025-05-07T20:23:53.1863099Z cpu cores : 8 2025-05-07T20:23:53.1863294Z apicid : 12 2025-05-07T20:23:53.1863480Z initial apicid : 12 2025-05-07T20:23:53.1863681Z fpu : yes 2025-05-07T20:23:53.1863877Z fpu_exception : yes 2025-05-07T20:23:53.1864186Z cpuid level : 13 2025-05-07T20:23:53.1864388Z wp : yes 2025-05-07T20:23:53.1866564Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1869115Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1869615Z bogomips : 5599.29 2025-05-07T20:23:53.1869848Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1870132Z clflush size : 64 2025-05-07T20:23:53.1870344Z cache_alignment : 64 2025-05-07T20:23:53.1870624Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1870947Z power management: 2025-05-07T20:23:53.1871077Z 2025-05-07T20:23:53.1871167Z processor : 7 2025-05-07T20:23:53.1871429Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1871665Z cpu family : 23 2025-05-07T20:23:53.1871871Z model : 49 2025-05-07T20:23:53.1872066Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1872304Z stepping : 0 2025-05-07T20:23:53.1872514Z microcode : 0x830107f 2025-05-07T20:23:53.1872728Z cpu MHz : 2583.007 2025-05-07T20:23:53.1872947Z cache size : 512 KB 2025-05-07T20:23:53.1873161Z physical id : 0 2025-05-07T20:23:53.1873366Z siblings : 16 2025-05-07T20:23:53.1873580Z core id : 7 2025-05-07T20:23:53.1873786Z cpu cores : 8 2025-05-07T20:23:53.1873982Z apicid : 14 2025-05-07T20:23:53.1874183Z initial apicid : 14 2025-05-07T20:23:53.1874401Z fpu : yes 2025-05-07T20:23:53.1874590Z fpu_exception : yes 2025-05-07T20:23:53.1874810Z cpuid level : 13 2025-05-07T20:23:53.1875018Z wp : yes 2025-05-07T20:23:53.1877118Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1879506Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1880013Z bogomips : 5599.29 2025-05-07T20:23:53.1880231Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1880469Z clflush size : 64 2025-05-07T20:23:53.1880678Z cache_alignment : 64 2025-05-07T20:23:53.1880942Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1881263Z power management: 2025-05-07T20:23:53.1881394Z 2025-05-07T20:23:53.1881473Z processor : 8 2025-05-07T20:23:53.1881695Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1881929Z cpu family : 23 2025-05-07T20:23:53.1882132Z model : 49 2025-05-07T20:23:53.1882344Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1882593Z stepping : 0 2025-05-07T20:23:53.1882802Z microcode : 0x830107f 2025-05-07T20:23:53.1883037Z cpu MHz : 1977.170 2025-05-07T20:23:53.1883253Z cache size : 512 KB 2025-05-07T20:23:53.1883466Z physical id : 0 2025-05-07T20:23:53.1883682Z siblings : 16 2025-05-07T20:23:53.1883888Z core id : 0 2025-05-07T20:23:53.1884096Z cpu cores : 8 2025-05-07T20:23:53.1884293Z apicid : 1 2025-05-07T20:23:53.1884500Z initial apicid : 1 2025-05-07T20:23:53.1884716Z fpu : yes 2025-05-07T20:23:53.1884913Z fpu_exception : yes 2025-05-07T20:23:53.1885128Z cpuid level : 13 2025-05-07T20:23:53.1885341Z wp : yes 2025-05-07T20:23:53.1887439Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1890069Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1890580Z bogomips : 5599.29 2025-05-07T20:23:53.1890805Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1891046Z clflush size : 64 2025-05-07T20:23:53.1891261Z cache_alignment : 64 2025-05-07T20:23:53.1891534Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1891856Z power management: 2025-05-07T20:23:53.1891989Z 2025-05-07T20:23:53.1892081Z processor : 9 2025-05-07T20:23:53.1892299Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1892539Z cpu family : 23 2025-05-07T20:23:53.1892745Z model : 49 2025-05-07T20:23:53.1892954Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1893199Z stepping : 0 2025-05-07T20:23:53.1893405Z microcode : 0x830107f 2025-05-07T20:23:53.1893634Z cpu MHz : 3026.540 2025-05-07T20:23:53.1893850Z cache size : 512 KB 2025-05-07T20:23:53.1894064Z physical id : 0 2025-05-07T20:23:53.1894277Z siblings : 16 2025-05-07T20:23:53.1894581Z core id : 1 2025-05-07T20:23:53.1894797Z cpu cores : 8 2025-05-07T20:23:53.1895009Z apicid : 3 2025-05-07T20:23:53.1895206Z initial apicid : 3 2025-05-07T20:23:53.1895413Z fpu : yes 2025-05-07T20:23:53.1895613Z fpu_exception : yes 2025-05-07T20:23:53.1895831Z cpuid level : 13 2025-05-07T20:23:53.1896039Z wp : yes 2025-05-07T20:23:53.1898139Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1900544Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1901055Z bogomips : 5599.29 2025-05-07T20:23:53.1901276Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1901509Z clflush size : 64 2025-05-07T20:23:53.1901726Z cache_alignment : 64 2025-05-07T20:23:53.1901999Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1902313Z power management: 2025-05-07T20:23:53.1902452Z 2025-05-07T20:23:53.1902538Z processor : 10 2025-05-07T20:23:53.1902755Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1902992Z cpu family : 23 2025-05-07T20:23:53.1903200Z model : 49 2025-05-07T20:23:53.1903407Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1903643Z stepping : 0 2025-05-07T20:23:53.1903856Z microcode : 0x830107f 2025-05-07T20:23:53.1904086Z cpu MHz : 1832.408 2025-05-07T20:23:53.1904293Z cache size : 512 KB 2025-05-07T20:23:53.1904511Z physical id : 0 2025-05-07T20:23:53.1904726Z siblings : 16 2025-05-07T20:23:53.1904930Z core id : 2 2025-05-07T20:23:53.1905127Z cpu cores : 8 2025-05-07T20:23:53.1905330Z apicid : 5 2025-05-07T20:23:53.1905539Z initial apicid : 5 2025-05-07T20:23:53.1905746Z fpu : yes 2025-05-07T20:23:53.1905947Z fpu_exception : yes 2025-05-07T20:23:53.1906166Z cpuid level : 13 2025-05-07T20:23:53.1906371Z wp : yes 2025-05-07T20:23:53.1908464Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1910935Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1911435Z bogomips : 5599.29 2025-05-07T20:23:53.1911738Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1911985Z clflush size : 64 2025-05-07T20:23:53.1912207Z cache_alignment : 64 2025-05-07T20:23:53.1912469Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1912792Z power management: 2025-05-07T20:23:53.1912928Z 2025-05-07T20:23:53.1913011Z processor : 11 2025-05-07T20:23:53.1913226Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1913459Z cpu family : 23 2025-05-07T20:23:53.1913672Z model : 49 2025-05-07T20:23:53.1913877Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1914115Z stepping : 0 2025-05-07T20:23:53.1914324Z microcode : 0x830107f 2025-05-07T20:23:53.1914550Z cpu MHz : 3289.948 2025-05-07T20:23:53.1914761Z cache size : 512 KB 2025-05-07T20:23:53.1914977Z physical id : 0 2025-05-07T20:23:53.1915185Z siblings : 16 2025-05-07T20:23:53.1915382Z core id : 3 2025-05-07T20:23:53.1915589Z cpu cores : 8 2025-05-07T20:23:53.1915791Z apicid : 7 2025-05-07T20:23:53.1915985Z initial apicid : 7 2025-05-07T20:23:53.1916203Z fpu : yes 2025-05-07T20:23:53.1916406Z fpu_exception : yes 2025-05-07T20:23:53.1916622Z cpuid level : 13 2025-05-07T20:23:53.1916828Z wp : yes 2025-05-07T20:23:53.1918928Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1921320Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1921829Z bogomips : 5599.29 2025-05-07T20:23:53.1922048Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1922291Z clflush size : 64 2025-05-07T20:23:53.1922514Z cache_alignment : 64 2025-05-07T20:23:53.1922784Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1923113Z power management: 2025-05-07T20:23:53.1923248Z 2025-05-07T20:23:53.1923341Z processor : 12 2025-05-07T20:23:53.1923560Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1923808Z cpu family : 23 2025-05-07T20:23:53.1924022Z model : 49 2025-05-07T20:23:53.1924230Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1924486Z stepping : 0 2025-05-07T20:23:53.1924706Z microcode : 0x830107f 2025-05-07T20:23:53.1924933Z cpu MHz : 3300.128 2025-05-07T20:23:53.1925156Z cache size : 512 KB 2025-05-07T20:23:53.1925384Z physical id : 0 2025-05-07T20:23:53.1925870Z siblings : 16 2025-05-07T20:23:53.1926083Z core id : 4 2025-05-07T20:23:53.1926293Z cpu cores : 8 2025-05-07T20:23:53.1926499Z apicid : 9 2025-05-07T20:23:53.1926708Z initial apicid : 9 2025-05-07T20:23:53.1926929Z fpu : yes 2025-05-07T20:23:53.1927130Z fpu_exception : yes 2025-05-07T20:23:53.1927363Z cpuid level : 13 2025-05-07T20:23:53.1927581Z wp : yes 2025-05-07T20:23:53.1929687Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1932294Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1932800Z bogomips : 5599.29 2025-05-07T20:23:53.1933031Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1933276Z clflush size : 64 2025-05-07T20:23:53.1933498Z cache_alignment : 64 2025-05-07T20:23:53.1933904Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1934240Z power management: 2025-05-07T20:23:53.1934378Z 2025-05-07T20:23:53.1934562Z processor : 13 2025-05-07T20:23:53.1934785Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1935022Z cpu family : 23 2025-05-07T20:23:53.1935227Z model : 49 2025-05-07T20:23:53.1935438Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1935687Z stepping : 0 2025-05-07T20:23:53.1935897Z microcode : 0x830107f 2025-05-07T20:23:53.1936134Z cpu MHz : 3300.583 2025-05-07T20:23:53.1936352Z cache size : 512 KB 2025-05-07T20:23:53.1936569Z physical id : 0 2025-05-07T20:23:53.1936773Z siblings : 16 2025-05-07T20:23:53.1936972Z core id : 5 2025-05-07T20:23:53.1937170Z cpu cores : 8 2025-05-07T20:23:53.1937365Z apicid : 11 2025-05-07T20:23:53.1937575Z initial apicid : 11 2025-05-07T20:23:53.1937788Z fpu : yes 2025-05-07T20:23:53.1937987Z fpu_exception : yes 2025-05-07T20:23:53.1938201Z cpuid level : 13 2025-05-07T20:23:53.1938406Z wp : yes 2025-05-07T20:23:53.1940508Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1942904Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1943409Z bogomips : 5599.29 2025-05-07T20:23:53.1943645Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1943879Z clflush size : 64 2025-05-07T20:23:53.1944099Z cache_alignment : 64 2025-05-07T20:23:53.1944371Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1944694Z power management: 2025-05-07T20:23:53.1944832Z 2025-05-07T20:23:53.1944919Z processor : 14 2025-05-07T20:23:53.1945143Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1945384Z cpu family : 23 2025-05-07T20:23:53.1945589Z model : 49 2025-05-07T20:23:53.1945795Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1946044Z stepping : 0 2025-05-07T20:23:53.1946252Z microcode : 0x830107f 2025-05-07T20:23:53.1946486Z cpu MHz : 1789.141 2025-05-07T20:23:53.1946707Z cache size : 512 KB 2025-05-07T20:23:53.1946929Z physical id : 0 2025-05-07T20:23:53.1947140Z siblings : 16 2025-05-07T20:23:53.1947346Z core id : 6 2025-05-07T20:23:53.1947546Z cpu cores : 8 2025-05-07T20:23:53.1947751Z apicid : 13 2025-05-07T20:23:53.1947960Z initial apicid : 13 2025-05-07T20:23:53.1948170Z fpu : yes 2025-05-07T20:23:53.1948379Z fpu_exception : yes 2025-05-07T20:23:53.1948597Z cpuid level : 13 2025-05-07T20:23:53.1948802Z wp : yes 2025-05-07T20:23:53.1950918Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1955053Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1955570Z bogomips : 5599.29 2025-05-07T20:23:53.1955801Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1956034Z clflush size : 64 2025-05-07T20:23:53.1956256Z cache_alignment : 64 2025-05-07T20:23:53.1956530Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1956851Z power management: 2025-05-07T20:23:53.1956991Z 2025-05-07T20:23:53.1957191Z processor : 15 2025-05-07T20:23:53.1957406Z vendor_id : AuthenticAMD 2025-05-07T20:23:53.1957639Z cpu family : 23 2025-05-07T20:23:53.1957843Z model : 49 2025-05-07T20:23:53.1958044Z model name : AMD EPYC 7R32 2025-05-07T20:23:53.1958276Z stepping : 0 2025-05-07T20:23:53.1958477Z microcode : 0x830107f 2025-05-07T20:23:53.1958697Z cpu MHz : 1758.835 2025-05-07T20:23:53.1958908Z cache size : 512 KB 2025-05-07T20:23:53.1959126Z physical id : 0 2025-05-07T20:23:53.1959349Z siblings : 16 2025-05-07T20:23:53.1959542Z core id : 7 2025-05-07T20:23:53.1959745Z cpu cores : 8 2025-05-07T20:23:53.1959979Z apicid : 15 2025-05-07T20:23:53.1960208Z initial apicid : 15 2025-05-07T20:23:53.1960433Z fpu : yes 2025-05-07T20:23:53.1960645Z fpu_exception : yes 2025-05-07T20:23:53.1960856Z cpuid level : 13 2025-05-07T20:23:53.1961071Z wp : yes 2025-05-07T20:23:53.1963181Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:23:53.1973149Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T20:23:53.1973661Z bogomips : 5599.29 2025-05-07T20:23:53.1973881Z TLB size : 3072 4K pages 2025-05-07T20:23:53.1974129Z clflush size : 64 2025-05-07T20:23:53.1974352Z cache_alignment : 64 2025-05-07T20:23:53.1974694Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T20:23:53.1975021Z power management: 2025-05-07T20:23:53.1975156Z 2025-05-07T20:23:53.1975161Z 2025-05-07T20:23:53.1975295Z ################################################################################ 2025-05-07T20:23:53.1975611Z [INFO] Print PCI info ... 2025-05-07T20:23:53.1975856Z + lspci -v 2025-05-07T20:23:53.1975970Z 2025-05-07T20:23:53.1976193Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-05-07T20:23:53.1976597Z Subsystem: Amazon.com, Inc. Device 1237 2025-05-07T20:23:53.1976935Z Flags: bus master, medium devsel, latency 0 2025-05-07T20:23:53.1977146Z 2025-05-07T20:23:53.1977355Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-05-07T20:23:53.1977753Z Physical Slot: 1 2025-05-07T20:23:53.1978003Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:53.1978209Z 2025-05-07T20:23:53.1978461Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-05-07T20:23:53.1978913Z Physical Slot: 1 2025-05-07T20:23:53.1979171Z Flags: bus master, fast devsel, latency 0, IRQ 9 2025-05-07T20:23:53.1979401Z 2025-05-07T20:23:53.1979683Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller]) 2025-05-07T20:23:53.1980138Z Physical Slot: 3 2025-05-07T20:23:53.1980405Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:53.1980785Z Memory at c1000000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:53.1981143Z Expansion ROM at 000c0000 [disabled] [size=128K] 2025-05-07T20:23:53.1981383Z 2025-05-07T20:23:53.1981694Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:53.1982353Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:53.1982655Z Physical Slot: 4 2025-05-07T20:23:53.1982918Z Flags: bus master, fast devsel, latency 0, IRQ 11 2025-05-07T20:23:53.1983313Z Memory at c1808000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:53.1983687Z Capabilities: 2025-05-07T20:23:53.1983971Z Kernel driver in use: nvme 2025-05-07T20:23:53.1984143Z 2025-05-07T20:23:53.1984461Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:53.1984961Z Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-05-07T20:23:53.1985320Z Physical Slot: 5 2025-05-07T20:23:53.1985568Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:53.1985938Z Memory at c1804000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:53.1986336Z Memory at c1400000 (32-bit, prefetchable) [size=4M] 2025-05-07T20:23:53.1986668Z Capabilities: 2025-05-07T20:23:53.1986943Z Kernel driver in use: ena 2025-05-07T20:23:53.1987195Z Kernel modules: ena 2025-05-07T20:23:53.1987337Z 2025-05-07T20:23:53.1987509Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:53.1987900Z Subsystem: NVIDIA Corporation Device 152f 2025-05-07T20:23:53.1988207Z Physical Slot: 30 2025-05-07T20:23:53.1988466Z Flags: bus master, fast devsel, latency 0, IRQ 10 2025-05-07T20:23:53.1988864Z Memory at c0000000 (32-bit, non-prefetchable) [size=16M] 2025-05-07T20:23:53.1989275Z Memory at 1800000000 (64-bit, prefetchable) [size=32G] 2025-05-07T20:23:53.1989667Z Memory at 1040000000 (64-bit, prefetchable) [size=32M] 2025-05-07T20:23:53.1990008Z Capabilities: 2025-05-07T20:23:53.1990288Z Kernel driver in use: nvidia 2025-05-07T20:23:53.1990593Z Kernel modules: nvidia 2025-05-07T20:23:53.1990748Z 2025-05-07T20:23:53.1991063Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express]) 2025-05-07T20:23:53.1991606Z Subsystem: Amazon.com, Inc. Device 0000 2025-05-07T20:23:53.1991907Z Physical Slot: 31 2025-05-07T20:23:53.1992149Z Flags: bus master, fast devsel, latency 0 2025-05-07T20:23:53.1992515Z Memory at c1800000 (32-bit, non-prefetchable) [size=16K] 2025-05-07T20:23:53.1992923Z Memory at c180c000 (32-bit, prefetchable) [size=8K] 2025-05-07T20:23:53.1993270Z Capabilities: 2025-05-07T20:23:53.1993539Z Kernel driver in use: nvme 2025-05-07T20:23:53.1993712Z 2025-05-07T20:23:53.1993716Z 2025-05-07T20:23:53.1993837Z ################################################################################ 2025-05-07T20:23:53.1994180Z [INFO] Print Linux distribution info ... 2025-05-07T20:23:53.1994472Z + uname -a 2025-05-07T20:23:53.1994602Z 2025-05-07T20:23:53.1995031Z Linux ip-10-0-35-243.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-05-07T20:23:53.1995562Z 2025-05-07T20:23:53.1995653Z + uname -m 2025-05-07T20:23:53.1995769Z 2025-05-07T20:23:53.1995850Z x86_64 2025-05-07T20:23:53.1995960Z 2025-05-07T20:23:53.1996048Z + cat /proc/version 2025-05-07T20:23:53.1996194Z 2025-05-07T20:23:53.1996768Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 2025-05-07T20:23:53.1997436Z 2025-05-07T20:23:53.1997527Z + cat /etc/os-release 2025-05-07T20:23:53.1997676Z 2025-05-07T20:23:53.1997779Z NAME="Amazon Linux" 2025-05-07T20:23:53.1997992Z VERSION="2023" 2025-05-07T20:23:53.1998200Z ID="amzn" 2025-05-07T20:23:53.1998461Z ID_LIKE="fedora" 2025-05-07T20:23:53.1998672Z VERSION_ID="2023" 2025-05-07T20:23:53.1998912Z PLATFORM_ID="platform:al2023" 2025-05-07T20:23:53.1999197Z PRETTY_NAME="Amazon Linux 2023.6.20250317" 2025-05-07T20:23:53.1999494Z ANSI_COLOR="0;33" 2025-05-07T20:23:53.1999750Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023" 2025-05-07T20:23:53.2000257Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/" 2025-05-07T20:23:53.2000703Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/" 2025-05-07T20:23:53.2001139Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/" 2025-05-07T20:23:53.2001601Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023" 2025-05-07T20:23:53.2001987Z VENDOR_NAME="AWS" 2025-05-07T20:23:53.2002236Z VENDOR_URL="https://aws.amazon.com/" 2025-05-07T20:23:53.2002538Z SUPPORT_END="2029-06-30" 2025-05-07T20:23:53.2002700Z 2025-05-07T20:23:53.2002908Z ################################################################################ 2025-05-07T20:23:53.2003224Z # Print EC2 Instance Info 2025-05-07T20:23:53.2003473Z # 2025-05-07T20:23:53.2003692Z # [2025-05-07T20:23:53.193Z] + print_ec2_info 2025-05-07T20:23:53.2004008Z ################################################################################ 2025-05-07T20:23:53.2004236Z 2025-05-07T20:23:53.2066783Z ami-id: ami-071226ecf16aa7d96 2025-05-07T20:23:53.2189774Z instance-id: i-02a13dec7b575dc8f 2025-05-07T20:23:53.2303158Z instance-type: g5.4xlarge 2025-05-07T20:23:53.2340515Z ##[group]Run . $PRELUDE; print_gpu_info 2025-05-07T20:23:53.2340894Z . $PRELUDE; print_gpu_info 2025-05-07T20:23:53.2351356Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:53.2351726Z env: 2025-05-07T20:23:53.2351961Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:53.2352293Z BUILD_ENV: build_binary 2025-05-07T20:23:53.2352560Z BUILD_TARGET: genai 2025-05-07T20:23:53.2352804Z BUILD_VARIANT: cuda 2025-05-07T20:23:53.2353061Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:53.2353336Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:53.2353647Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:53.2354009Z ##[endgroup] 2025-05-07T20:23:53.5724299Z ################################################################################ 2025-05-07T20:23:53.5724804Z [INFO] Printing general display info ... 2025-05-07T20:23:53.5757207Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:53.6811986Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:53.6821090Z /usr/bin/sudo 2025-05-07T20:23:53.6832781Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:53.6843210Z /usr/bin/yum 2025-05-07T20:23:53.6844906Z [INSTALL] Updating system repositories ... 2025-05-07T20:23:53.6866067Z [EXEC] [ATTEMPT 0/3] + sudo yum update -y 2025-05-07T20:23:54.1547520Z Last metadata expiration check: 0:00:08 ago on Wed May 7 20:23:46 2025. 2025-05-07T20:23:54.2244383Z ================================================================================ 2025-05-07T20:23:54.2244751Z WARNING: 2025-05-07T20:23:54.2244989Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:54.2245233Z 2025-05-07T20:23:54.2245324Z Available Versions: 2025-05-07T20:23:54.2245468Z 2025-05-07T20:23:54.2245562Z Version 2023.7.20250331: 2025-05-07T20:23:54.2245867Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:54.2246163Z 2025-05-07T20:23:54.2246297Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:54.2246517Z 2025-05-07T20:23:54.2246604Z Release notes: 2025-05-07T20:23:54.2247025Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:54.2247411Z 2025-05-07T20:23:54.2247502Z Version 2023.7.20250414: 2025-05-07T20:23:54.2247811Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:54.2248063Z 2025-05-07T20:23:54.2248187Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:54.2248402Z 2025-05-07T20:23:54.2248492Z Release notes: 2025-05-07T20:23:54.2248885Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:54.2249265Z 2025-05-07T20:23:54.2249350Z Version 2023.7.20250428: 2025-05-07T20:23:54.2249656Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:54.2250169Z 2025-05-07T20:23:54.2250311Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:54.2250562Z 2025-05-07T20:23:54.2250654Z Release notes: 2025-05-07T20:23:54.2251062Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:54.2251439Z 2025-05-07T20:23:54.2251565Z ================================================================================ 2025-05-07T20:23:54.3423473Z Dependencies resolved. 2025-05-07T20:23:54.3712810Z ================================================================================ 2025-05-07T20:23:54.3713237Z Package Arch Version Repository Size 2025-05-07T20:23:54.3713638Z ================================================================================ 2025-05-07T20:23:54.3713960Z Upgrading: 2025-05-07T20:23:54.3714320Z nvidia-container-toolkit x86_64 1.17.6-1 nvidia-container-toolkit 1.2 M 2025-05-07T20:23:54.3714932Z nvidia-container-toolkit-base x86_64 1.17.6-1 nvidia-container-toolkit 5.7 M 2025-05-07T20:23:54.3715330Z 2025-05-07T20:23:54.3717181Z Transaction Summary 2025-05-07T20:23:54.3717462Z ================================================================================ 2025-05-07T20:23:54.3717777Z Upgrade 2 Packages 2025-05-07T20:23:54.3717925Z 2025-05-07T20:23:54.3718030Z Total download size: 6.9 M 2025-05-07T20:23:54.3718299Z Downloading Packages: 2025-05-07T20:23:54.4063221Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64 37 MB/s | 1.2 MB 00:00 2025-05-07T20:23:54.4648018Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x 62 MB/s | 5.7 MB 00:00 2025-05-07T20:23:54.4655624Z -------------------------------------------------------------------------------- 2025-05-07T20:23:54.4658505Z Total 74 MB/s | 6.9 MB 00:00 2025-05-07T20:23:54.4660814Z Running transaction check 2025-05-07T20:23:54.4756208Z Transaction check succeeded. 2025-05-07T20:23:54.4756737Z Running transaction test 2025-05-07T20:23:54.5051197Z Transaction test succeeded. 2025-05-07T20:23:54.5054447Z Running transaction 2025-05-07T20:23:55.0567340Z Preparing : 1/1 2025-05-07T20:23:55.1621608Z Upgrading : nvidia-container-toolkit-base-1.17.6-1.x86_64 1/4 2025-05-07T20:23:55.1645224Z Upgrading : nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:55.1841760Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 2/4 2025-05-07T20:23:55.1842653Z Cleanup : nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:55.1951050Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 3/4 2025-05-07T20:23:55.1971981Z Cleanup : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:55.3421618Z Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64 4/4 2025-05-07T20:23:55.3422244Z Verifying : nvidia-container-toolkit-1.17.6-1.x86_64 1/4 2025-05-07T20:23:55.3422848Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 2/4 2025-05-07T20:23:55.3423399Z Verifying : nvidia-container-toolkit-base-1.17.6-1.x86_64 3/4 2025-05-07T20:23:55.4803798Z ================================================================================ 2025-05-07T20:23:55.4804173Z WARNING: 2025-05-07T20:23:55.4804423Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:55.4804658Z 2025-05-07T20:23:55.4804763Z Available Versions: 2025-05-07T20:23:55.4804913Z 2025-05-07T20:23:55.4805004Z Version 2023.7.20250331: 2025-05-07T20:23:55.4805325Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:55.4805584Z 2025-05-07T20:23:55.4805715Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:55.4805929Z 2025-05-07T20:23:55.4806015Z Release notes: 2025-05-07T20:23:55.4806436Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:55.4807092Z 2025-05-07T20:23:55.4807198Z Version 2023.7.20250414: 2025-05-07T20:23:55.4807516Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:55.4807773Z 2025-05-07T20:23:55.4807894Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:55.4808117Z 2025-05-07T20:23:55.4808203Z Release notes: 2025-05-07T20:23:55.4808613Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:55.4808995Z 2025-05-07T20:23:55.4809085Z Version 2023.7.20250428: 2025-05-07T20:23:55.4809405Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:55.4809669Z 2025-05-07T20:23:55.4809785Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:55.4810001Z 2025-05-07T20:23:55.4810095Z Release notes: 2025-05-07T20:23:55.4810498Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:55.4810889Z 2025-05-07T20:23:55.4811201Z ================================================================================ 2025-05-07T20:23:55.5376535Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 4/4 2025-05-07T20:23:55.5376880Z 2025-05-07T20:23:55.5376977Z Upgraded: 2025-05-07T20:23:55.5377317Z nvidia-container-toolkit-1.17.6-1.x86_64 2025-05-07T20:23:55.5377900Z nvidia-container-toolkit-base-1.17.6-1.x86_64 2025-05-07T20:23:55.5378248Z 2025-05-07T20:23:55.5378338Z Complete! 2025-05-07T20:23:55.5843223Z [INSTALL] Installing system package(s): hostname lshw ... 2025-05-07T20:23:55.5866806Z [EXEC] [ATTEMPT 0/3] + sudo yum install -y hostname lshw 2025-05-07T20:23:56.0437326Z Last metadata expiration check: 0:00:10 ago on Wed May 7 20:23:46 2025. 2025-05-07T20:23:56.0678033Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed. 2025-05-07T20:23:56.1091611Z Dependencies resolved. 2025-05-07T20:23:56.1268961Z ================================================================================ 2025-05-07T20:23:56.1269433Z Package Architecture Version Repository Size 2025-05-07T20:23:56.1269863Z ================================================================================ 2025-05-07T20:23:56.1270171Z Installing: 2025-05-07T20:23:56.1270474Z lshw x86_64 B.02.19.2-7.amzn2023.0.3 amazonlinux 319 k 2025-05-07T20:23:56.1270765Z 2025-05-07T20:23:56.1270864Z Transaction Summary 2025-05-07T20:23:56.1271134Z ================================================================================ 2025-05-07T20:23:56.1271444Z Install 1 Package 2025-05-07T20:23:56.1271581Z 2025-05-07T20:23:56.1271702Z Total download size: 319 k 2025-05-07T20:23:56.1272310Z Installed size: 837 k 2025-05-07T20:23:56.1273733Z Downloading Packages: 2025-05-07T20:23:56.2043522Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm 6.6 MB/s | 319 kB 00:00 2025-05-07T20:23:56.2049045Z -------------------------------------------------------------------------------- 2025-05-07T20:23:56.2051857Z Total 4.0 MB/s | 319 kB 00:00 2025-05-07T20:23:56.2210505Z Running transaction check 2025-05-07T20:23:56.2264554Z Transaction check succeeded. 2025-05-07T20:23:56.2264958Z Running transaction test 2025-05-07T20:23:56.2723862Z Transaction test succeeded. 2025-05-07T20:23:56.2728196Z Running transaction 2025-05-07T20:23:56.3766730Z Preparing : 1/1 2025-05-07T20:23:56.4277955Z Installing : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:56.6154024Z Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:56.7549288Z ================================================================================ 2025-05-07T20:23:56.7549713Z WARNING: 2025-05-07T20:23:56.7549961Z A newer release of "Amazon Linux" is available. 2025-05-07T20:23:56.7550532Z 2025-05-07T20:23:56.7550633Z Available Versions: 2025-05-07T20:23:56.7550801Z 2025-05-07T20:23:56.7550903Z Version 2023.7.20250331: 2025-05-07T20:23:56.7551268Z Run the following command to upgrade to 2023.7.20250331: 2025-05-07T20:23:56.7551539Z 2025-05-07T20:23:56.7551664Z dnf upgrade --releasever=2023.7.20250331 2025-05-07T20:23:56.7551890Z 2025-05-07T20:23:56.7551978Z Release notes: 2025-05-07T20:23:56.7552401Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html 2025-05-07T20:23:56.7552790Z 2025-05-07T20:23:56.7552878Z Version 2023.7.20250414: 2025-05-07T20:23:56.7553192Z Run the following command to upgrade to 2023.7.20250414: 2025-05-07T20:23:56.7553445Z 2025-05-07T20:23:56.7553569Z dnf upgrade --releasever=2023.7.20250414 2025-05-07T20:23:56.7553783Z 2025-05-07T20:23:56.7553874Z Release notes: 2025-05-07T20:23:56.7554268Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html 2025-05-07T20:23:56.7554660Z 2025-05-07T20:23:56.7554909Z Version 2023.7.20250428: 2025-05-07T20:23:56.7555229Z Run the following command to upgrade to 2023.7.20250428: 2025-05-07T20:23:56.7555483Z 2025-05-07T20:23:56.7555600Z dnf upgrade --releasever=2023.7.20250428 2025-05-07T20:23:56.7555823Z 2025-05-07T20:23:56.7555914Z Release notes: 2025-05-07T20:23:56.7556323Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html 2025-05-07T20:23:56.7556700Z 2025-05-07T20:23:56.7556819Z ================================================================================ 2025-05-07T20:23:56.7895790Z Verifying : lshw-B.02.19.2-7.amzn2023.0.3.x86_64 1/1 2025-05-07T20:23:56.7896139Z 2025-05-07T20:23:56.7896230Z Installed: 2025-05-07T20:23:56.7896551Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64 2025-05-07T20:23:56.7896851Z 2025-05-07T20:23:56.7896943Z Complete! 2025-05-07T20:23:56.8376743Z + hostname 2025-05-07T20:23:56.8376890Z 2025-05-07T20:23:56.8390354Z ip-10-0-35-243.ec2.internal 2025-05-07T20:23:56.8391356Z 2025-05-07T20:23:56.8391989Z + sudo lshw -C display 2025-05-07T20:23:56.8392138Z 2025-05-07T20:23:57.2437613Z *-display:0 UNCLAIMED 2025-05-07T20:23:57.2437932Z description: VGA compatible controller 2025-05-07T20:23:57.2438275Z product: Amazon.com, Inc. 2025-05-07T20:23:57.2438559Z vendor: Amazon.com, Inc. 2025-05-07T20:23:57.2438827Z physical id: 3 2025-05-07T20:23:57.2439075Z bus info: pci@0000:00:03.0 2025-05-07T20:23:57.2439342Z version: 00 2025-05-07T20:23:57.2439565Z width: 32 bits 2025-05-07T20:23:57.2447923Z clock: 33MHz 2025-05-07T20:23:57.2448225Z capabilities: vga_controller bus_master 2025-05-07T20:23:57.2448567Z configuration: latency=0 2025-05-07T20:23:57.2448910Z resources: memory:c1000000-c13fffff memory:c0000-dffff 2025-05-07T20:23:57.2449254Z *-display:1 2025-05-07T20:23:57.2449518Z description: 3D controller 2025-05-07T20:23:57.2449823Z product: GA102GL [A10G] 2025-05-07T20:23:57.2450095Z vendor: NVIDIA Corporation 2025-05-07T20:23:57.2450374Z physical id: 1e 2025-05-07T20:23:57.2450621Z bus info: pci@0000:00:1e.0 2025-05-07T20:23:57.2450880Z version: a1 2025-05-07T20:23:57.2451102Z width: 64 bits 2025-05-07T20:23:57.2451332Z clock: 33MHz 2025-05-07T20:23:57.2451623Z capabilities: pm pciexpress msix bus_master cap_list 2025-05-07T20:23:57.2452013Z configuration: driver=nvidia latency=0 2025-05-07T20:23:57.2452665Z resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff 2025-05-07T20:23:57.2475689Z 2025-05-07T20:23:57.2475954Z ################################################################################ 2025-05-07T20:23:57.2476318Z [INFO] Printing NVIDIA GPU info ... 2025-05-07T20:23:57.2612517Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-05-07T20:23:57.2781176Z Wed May 7 20:23:57 2025 2025-05-07T20:23:57.2781595Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:57.2782127Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:23:57.2782625Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:57.2783136Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:23:57.2783689Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:23:57.2784136Z | | | MIG M. | 2025-05-07T20:23:57.2784490Z |=========================================+========================+======================| 2025-05-07T20:23:57.2864493Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:23:57.2865186Z | 0% 31C P0 60W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:23:57.2865582Z | | | N/A | 2025-05-07T20:23:57.2865990Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:23:57.2866401Z 2025-05-07T20:23:57.2866813Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:57.2867248Z | Processes: | 2025-05-07T20:23:57.2867709Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:23:57.2868145Z | ID ID Usage | 2025-05-07T20:23:57.2868520Z |=========================================================================================| 2025-05-07T20:23:57.2869653Z | No running processes found | 2025-05-07T20:23:57.2870145Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:23:57.4307908Z ################################################################################ 2025-05-07T20:23:57.4308262Z [INFO] Printing AMD GPU info ... 2025-05-07T20:23:57.4450597Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:57.4451323Z [CHECK] rocminfo not found 2025-05-07T20:23:57.4461190Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin) 2025-05-07T20:23:57.4462186Z [CHECK] rocm-smi not found 2025-05-07T20:23:57.4499291Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:57.4499749Z . $PRELUDE; setup_miniconda $HOME/miniconda 2025-05-07T20:23:57.4511643Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:23:57.4512066Z env: 2025-05-07T20:23:57.4512375Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:23:57.4512694Z BUILD_ENV: build_binary 2025-05-07T20:23:57.4512951Z BUILD_TARGET: genai 2025-05-07T20:23:57.4513184Z BUILD_VARIANT: cuda 2025-05-07T20:23:57.4513417Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:23:57.4513685Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:23:57.4513991Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:23:57.4514331Z ##[endgroup] 2025-05-07T20:23:57.7889432Z ################################################################################ 2025-05-07T20:23:57.7889809Z # Setup Miniconda 2025-05-07T20:23:57.7890033Z # 2025-05-07T20:23:57.7906265Z # [2025-05-07T20:23:57.790Z] + setup_miniconda /home/ec2-user/miniconda 2025-05-07T20:23:57.7906676Z ################################################################################ 2025-05-07T20:23:57.7906903Z 2025-05-07T20:23:57.7921533Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:23:57.8810753Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:23:57.8811144Z + mkdir -p /home/ec2-user/miniconda 2025-05-07T20:23:57.8811345Z 2025-05-07T20:23:57.8829239Z 2025-05-07T20:23:57.8829573Z [SETUP] Downloading the Miniconda installer ... 2025-05-07T20:23:57.8852372Z [EXEC] [ATTEMPT 0/3] + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh 2025-05-07T20:23:58.6429154Z [SETUP] Installing Miniconda ... 2025-05-07T20:23:58.6429574Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u 2025-05-07T20:23:58.6429836Z 2025-05-07T20:23:58.6575855Z PREFIX=/home/ec2-user/miniconda 2025-05-07T20:23:59.1090795Z Unpacking payload ... 2025-05-07T20:23:59.6259102Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:24:00.4275005Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:24:02.5521178Z 2025-05-07T20:24:02.5521709Z Installing base environment... 2025-05-07T20:24:02.5521940Z 2025-05-07T20:24:03.6317761Z Preparing transaction: ...working... done 2025-05-07T20:24:06.6632137Z Executing transaction: ...working... done 2025-05-07T20:24:07.3545178Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior. 2025-05-07T20:24:07.4468349Z installation finished. 2025-05-07T20:24:07.4475682Z 2025-05-07T20:24:07.4475890Z + rm -f miniconda.sh 2025-05-07T20:24:07.4476080Z 2025-05-07T20:24:07.4784267Z 2025-05-07T20:24:07.4784579Z [SETUP] Reloading the bash configuration ... 2025-05-07T20:24:07.4784952Z + /home/ec2-user/miniconda/bin/conda init bash 2025-05-07T20:24:07.4785206Z 2025-05-07T20:24:07.8468699Z no change /home/ec2-user/miniconda/condabin/conda 2025-05-07T20:24:07.8469218Z no change /home/ec2-user/miniconda/bin/conda 2025-05-07T20:24:07.8469735Z no change /home/ec2-user/miniconda/bin/conda-env 2025-05-07T20:24:07.8470206Z no change /home/ec2-user/miniconda/bin/activate 2025-05-07T20:24:07.8470677Z no change /home/ec2-user/miniconda/bin/deactivate 2025-05-07T20:24:07.8471142Z no change /home/ec2-user/miniconda/etc/profile.d/conda.sh 2025-05-07T20:24:07.8471595Z no change /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish 2025-05-07T20:24:07.8472048Z no change /home/ec2-user/miniconda/shell/condabin/Conda.psm1 2025-05-07T20:24:07.8472524Z no change /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1 2025-05-07T20:24:07.8473604Z no change /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh 2025-05-07T20:24:07.8474146Z no change /home/ec2-user/miniconda/etc/profile.d/conda.csh 2025-05-07T20:24:07.8474545Z modified /home/ec2-user/.bashrc 2025-05-07T20:24:07.8474742Z 2025-05-07T20:24:07.8474952Z ==> For changes to take effect, close and re-open your current shell. <== 2025-05-07T20:24:07.8475259Z 2025-05-07T20:24:07.9204642Z 2025-05-07T20:24:07.9210637Z + . /home/ec2-user/.bashrc 2025-05-07T20:24:07.9210817Z 2025-05-07T20:24:08.7738499Z 2025-05-07T20:24:08.7739130Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ... 2025-05-07T20:24:08.7762210Z [EXEC] [ATTEMPT 0/3] + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive 2025-05-07T20:24:22.2591405Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:24:23.8994694Z Solving environment: \ | / - \ | / - \ | / - done 2025-05-07T20:24:23.9955161Z 2025-05-07T20:24:23.9955363Z ## Package Plan ## 2025-05-07T20:24:23.9955585Z 2025-05-07T20:24:23.9955769Z environment location: /home/ec2-user/miniconda 2025-05-07T20:24:23.9956044Z 2025-05-07T20:24:23.9956143Z added / updated specs: 2025-05-07T20:24:23.9956411Z - conda-libmamba-solver 2025-05-07T20:24:23.9956658Z - libarchive 2025-05-07T20:24:23.9956878Z - libmamba 2025-05-07T20:24:23.9957087Z - libmambapy 2025-05-07T20:24:23.9957212Z 2025-05-07T20:24:23.9957217Z 2025-05-07T20:24:23.9957357Z The following packages will be downloaded: 2025-05-07T20:24:23.9957573Z 2025-05-07T20:24:23.9957687Z package | build 2025-05-07T20:24:23.9958015Z ---------------------------|----------------- 2025-05-07T20:24:23.9958438Z ca-certificates-2025.4.26 | hbd8a1cb_0 149 KB conda-forge 2025-05-07T20:24:23.9958912Z certifi-2025.4.26 | pyhd8ed1ab_0 154 KB conda-forge 2025-05-07T20:24:23.9959346Z conda-25.3.1 | py313h78bf25f_1 1.1 MB conda-forge 2025-05-07T20:24:23.9959862Z conda-libmamba-solver-25.4.0| pyhd8ed1ab_0 41 KB conda-forge 2025-05-07T20:24:23.9960424Z ------------------------------------------------------------ 2025-05-07T20:24:23.9960764Z Total: 1.4 MB 2025-05-07T20:24:23.9960981Z 2025-05-07T20:24:23.9961094Z The following packages will be UPDATED: 2025-05-07T20:24:23.9961299Z 2025-05-07T20:24:23.9964983Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:24:23.9965795Z conda pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 2025-05-07T20:24:23.9966328Z 2025-05-07T20:24:23.9966556Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:24:23.9966888Z 2025-05-07T20:24:23.9967215Z certifi pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 2025-05-07T20:24:23.9968042Z conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 2025-05-07T20:24:23.9968546Z 2025-05-07T20:24:23.9968550Z 2025-05-07T20:24:23.9968554Z 2025-05-07T20:24:23.9968704Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:23.9969092Z conda-25.3.1 | 1.1 MB | | 0% 2025-05-07T20:24:23.9969319Z 2025-05-07T20:24:23.9969822Z certifi-2025.4.26 | 154 KB | | 0%  2025-05-07T20:24:23.9970078Z 2025-05-07T20:24:23.9970082Z 2025-05-07T20:24:23.9971778Z ca-certificates-2025 | 149 KB | | 0%  2025-05-07T20:24:23.9972126Z 2025-05-07T20:24:23.9972149Z 2025-05-07T20:24:23.9972154Z 2025-05-07T20:24:24.0535989Z conda-libmamba-solve | 41 KB | | 0%  2025-05-07T20:24:24.0756318Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:24.0756580Z 2025-05-07T20:24:24.0756584Z 2025-05-07T20:24:24.0756588Z 2025-05-07T20:24:24.0915161Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:24.0915465Z 2025-05-07T20:24:24.0916185Z 2025-05-07T20:24:24.1025877Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:24:24.1026164Z 2025-05-07T20:24:24.1026168Z 2025-05-07T20:24:24.1026172Z 2025-05-07T20:24:24.1034116Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:24.1034468Z 2025-05-07T20:24:24.1034472Z 2025-05-07T20:24:24.1034502Z 2025-05-07T20:24:24.1090795Z conda-libmamba-solve | 41 KB | ########## | 100%  2025-05-07T20:24:24.1091241Z 2025-05-07T20:24:24.1157962Z certifi-2025.4.26 | 154 KB | # | 10%  2025-05-07T20:24:24.1158584Z 2025-05-07T20:24:24.1158590Z 2025-05-07T20:24:24.1163508Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:24:24.1163848Z 2025-05-07T20:24:24.1170262Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:24:24.1170539Z 2025-05-07T20:24:24.1170544Z 2025-05-07T20:24:24.1311744Z ca-certificates-2025 | 149 KB | ########## | 100%  2025-05-07T20:24:24.1312137Z 2025-05-07T20:24:24.2020893Z certifi-2025.4.26 | 154 KB | ########## | 100%  2025-05-07T20:24:24.2021312Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:24.2026774Z conda-25.3.1 | 1.1 MB | ########## | 100% 2025-05-07T20:24:24.2027243Z 2025-05-07T20:24:24.2027660Z 2025-05-07T20:24:24.2027898Z  2025-05-07T20:24:24.2028152Z 2025-05-07T20:24:24.2028176Z 2025-05-07T20:24:24.2028405Z  2025-05-07T20:24:24.2028648Z 2025-05-07T20:24:24.2028653Z 2025-05-07T20:24:24.2028658Z 2025-05-07T20:24:24.2028841Z  done 2025-05-07T20:24:24.3033970Z Preparing transaction: | done 2025-05-07T20:24:24.4039961Z Verifying transaction: - done 2025-05-07T20:24:25.8061488Z Executing transaction: | / - \ | / - \ | / - \ | / done 2025-05-07T20:24:27.6603394Z [SETUP] Updating Miniconda base packages ... 2025-05-07T20:24:27.6630112Z [EXEC] [ATTEMPT 0/3] + conda update -n base -c defaults --update-deps -y conda 2025-05-07T20:24:28.6032156Z Channels: 2025-05-07T20:24:28.6032407Z - defaults 2025-05-07T20:24:28.6032627Z Platform: linux-64 2025-05-07T20:24:29.8626682Z Collecting package metadata (repodata.json): - \ | / - \ | / done 2025-05-07T20:24:29.9840081Z Solving environment: \ | Channels: 2025-05-07T20:24:29.9840688Z - defaults 2025-05-07T20:24:29.9841135Z Platform: linux-64 2025-05-07T20:24:30.2647559Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:24:30.4804144Z Solving environment: \ | / - done 2025-05-07T20:24:30.5628407Z done 2025-05-07T20:24:30.6287357Z 2025-05-07T20:24:30.6287939Z ## Package Plan ## 2025-05-07T20:24:30.6288213Z 2025-05-07T20:24:30.6288466Z environment location: /home/ec2-user/miniconda 2025-05-07T20:24:30.6288923Z 2025-05-07T20:24:30.6289098Z added / updated specs: 2025-05-07T20:24:30.6289543Z - conda 2025-05-07T20:24:30.6289754Z 2025-05-07T20:24:30.6289762Z 2025-05-07T20:24:30.6289979Z The following packages will be downloaded: 2025-05-07T20:24:30.6290386Z 2025-05-07T20:24:30.6290542Z package | build 2025-05-07T20:24:30.6291160Z ---------------------------|----------------- 2025-05-07T20:24:30.6291525Z pip-25.1 | pyhc872135_2 1.3 MB 2025-05-07T20:24:30.6291913Z tzdata-2025b | h04d1e81_0 116 KB 2025-05-07T20:24:30.6292324Z ------------------------------------------------------------ 2025-05-07T20:24:30.6292673Z Total: 1.4 MB 2025-05-07T20:24:30.6292887Z 2025-05-07T20:24:30.6293012Z The following packages will be UPDATED: 2025-05-07T20:24:30.6293224Z 2025-05-07T20:24:30.6293528Z pip pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:30.6294057Z tzdata 2025a-h04d1e81_0 --> 2025b-h04d1e81_0 2025-05-07T20:24:30.6294315Z 2025-05-07T20:24:30.6294319Z 2025-05-07T20:24:30.6294339Z 2025-05-07T20:24:30.6294487Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:30.6294969Z pip-25.1 | 1.3 MB | | 0% 2025-05-07T20:24:30.6295189Z 2025-05-07T20:24:30.6572085Z tzdata-2025b | 116 KB | | 0%  2025-05-07T20:24:30.6572439Z 2025-05-07T20:24:30.7109257Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:30.8944311Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:30.8944574Z 2025-05-07T20:24:30.8947311Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:30.8947561Z 2025-05-07T20:24:30.9275245Z tzdata-2025b | 116 KB | ########## | 100%  2025-05-07T20:24:30.9275815Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:30.9280091Z pip-25.1 | 1.3 MB | ########## | 100% 2025-05-07T20:24:30.9280545Z 2025-05-07T20:24:30.9280765Z 2025-05-07T20:24:30.9281303Z  done 2025-05-07T20:24:31.0286921Z Preparing transaction: | done 2025-05-07T20:24:31.1294151Z Verifying transaction: - done 2025-05-07T20:24:33.2320850Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:24:33.8550662Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:24:33.8554712Z + conda clean --packages --tarball -y 2025-05-07T20:24:33.8554933Z 2025-05-07T20:24:34.8952057Z Will remove 99 (117.8 MB) tarball(s). 2025-05-07T20:24:34.8952449Z Will remove 11 (16.0 MB) package(s). 2025-05-07T20:24:34.9630924Z 2025-05-07T20:24:34.9638684Z + conda clean --all -y 2025-05-07T20:24:34.9638926Z 2025-05-07T20:24:35.5061360Z There are no unused tarball(s) to remove. 2025-05-07T20:24:35.5061822Z Will remove 1 index cache(s). 2025-05-07T20:24:35.5062252Z There are no unused package(s) to remove. 2025-05-07T20:24:35.5062579Z There are no tempfile(s) to remove. 2025-05-07T20:24:35.5062874Z There are no logfile(s) to remove. 2025-05-07T20:24:35.5738118Z 2025-05-07T20:24:35.5743726Z + conda info 2025-05-07T20:24:35.5743852Z 2025-05-07T20:24:36.3511792Z 2025-05-07T20:24:36.3512512Z active environment : base 2025-05-07T20:24:36.3513289Z active env location : /home/ec2-user/miniconda 2025-05-07T20:24:36.3513945Z shell level : 1 2025-05-07T20:24:36.3514532Z user config file : /home/ec2-user/.condarc 2025-05-07T20:24:36.3515316Z populated config files : /home/ec2-user/miniconda/.condarc 2025-05-07T20:24:36.3516041Z conda version : 25.3.1 2025-05-07T20:24:36.3516600Z conda-build version : not installed 2025-05-07T20:24:36.3517198Z python version : 3.13.2.final.0 2025-05-07T20:24:36.3517789Z solver : libmamba (default) 2025-05-07T20:24:36.3518404Z virtual packages : __archspec=1=zen2 2025-05-07T20:24:36.3518999Z __conda=25.3.1=0 2025-05-07T20:24:36.3519558Z __cuda=12.8=0 2025-05-07T20:24:36.3520104Z __glibc=2.34=0 2025-05-07T20:24:36.3520661Z __linux=6.1.130=0 2025-05-07T20:24:36.3521711Z __unix=0=0 2025-05-07T20:24:36.3522215Z base environment : /home/ec2-user/miniconda (writable) 2025-05-07T20:24:36.3522654Z conda av data dir : /home/ec2-user/miniconda/etc/conda 2025-05-07T20:24:36.3523011Z conda av metadata url : None 2025-05-07T20:24:36.3523378Z channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 2025-05-07T20:24:36.3523821Z https://repo.anaconda.com/pkgs/main/noarch 2025-05-07T20:24:36.3524213Z https://repo.anaconda.com/pkgs/r/linux-64 2025-05-07T20:24:36.3524599Z https://repo.anaconda.com/pkgs/r/noarch 2025-05-07T20:24:36.3524966Z package cache : /home/ec2-user/miniconda/pkgs 2025-05-07T20:24:36.3525309Z /home/ec2-user/.conda/pkgs 2025-05-07T20:24:36.3525919Z envs directories : /home/ec2-user/miniconda/envs 2025-05-07T20:24:36.3526264Z /home/ec2-user/.conda/envs 2025-05-07T20:24:36.3526584Z platform : linux-64 2025-05-07T20:24:36.3527438Z user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/. 2025-05-07T20:24:36.3528486Z UID:GID : 1000:1000 2025-05-07T20:24:36.3528764Z netrc file : None 2025-05-07T20:24:36.3529035Z offline mode : False 2025-05-07T20:24:36.3529208Z 2025-05-07T20:24:36.4211810Z 2025-05-07T20:24:36.4212274Z [SETUP] Exporting Miniconda variables ... 2025-05-07T20:24:36.4213033Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_4041281f-d965-4d6b-b629-f6367ffc8ef1 ... 2025-05-07T20:24:36.4213843Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda 2025-05-07T20:24:36.4303028Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.12 2025-05-07T20:24:36.4303526Z . $PRELUDE; create_conda_environment $BUILD_ENV 3.12 2025-05-07T20:24:36.4323152Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:24:36.4323517Z env: 2025-05-07T20:24:36.4323772Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:24:36.4324080Z BUILD_ENV: build_binary 2025-05-07T20:24:36.4324339Z BUILD_TARGET: genai 2025-05-07T20:24:36.4324577Z BUILD_VARIANT: cuda 2025-05-07T20:24:36.4324810Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:24:36.4325075Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:24:36.4325715Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:24:36.4326129Z ##[endgroup] 2025-05-07T20:24:36.7755616Z ################################################################################ 2025-05-07T20:24:36.7756009Z # Create Conda Environment 2025-05-07T20:24:36.7756264Z # 2025-05-07T20:24:36.7771710Z # [2025-05-07T20:24:36.776Z] + create_conda_environment build_binary 3.12 2025-05-07T20:24:36.7772189Z ################################################################################ 2025-05-07T20:24:36.7772415Z 2025-05-07T20:24:36.7788694Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:24:37.0230155Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:24:37.0230555Z [SETUP] Listing existing Conda environments ... 2025-05-07T20:24:37.0230902Z + conda info --envs 2025-05-07T20:24:37.0231041Z 2025-05-07T20:24:37.7782266Z 2025-05-07T20:24:37.7782925Z # conda environments: 2025-05-07T20:24:37.7783213Z # 2025-05-07T20:24:37.7783447Z base /home/ec2-user/miniconda 2025-05-07T20:24:37.7783679Z 2025-05-07T20:24:37.8489053Z 2025-05-07T20:24:37.8489640Z [SETUP] Deleting the prefix directory if it exists ... 2025-05-07T20:24:39.4956041Z + rm -rf /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:39.4956805Z 2025-05-07T20:24:39.4972470Z 2025-05-07T20:24:39.4982730Z [SETUP] Creating new Conda environment (Python 3.12) ... 2025-05-07T20:24:39.5003755Z [EXEC] [ATTEMPT 0/3] + conda create -y -n build_binary python=3.12 2025-05-07T20:24:40.2586772Z Channels: 2025-05-07T20:24:40.2587019Z - defaults 2025-05-07T20:24:40.2587231Z Platform: linux-64 2025-05-07T20:24:41.8329746Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | done 2025-05-07T20:24:41.9334818Z Solving environment: - done 2025-05-07T20:24:41.9625260Z 2025-05-07T20:24:41.9625821Z ## Package Plan ## 2025-05-07T20:24:41.9626100Z 2025-05-07T20:24:41.9626393Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:24:41.9626825Z 2025-05-07T20:24:41.9626959Z added / updated specs: 2025-05-07T20:24:41.9627240Z - python=3.12 2025-05-07T20:24:41.9627373Z 2025-05-07T20:24:41.9627377Z 2025-05-07T20:24:41.9627496Z The following packages will be downloaded: 2025-05-07T20:24:41.9627722Z 2025-05-07T20:24:41.9627868Z package | build 2025-05-07T20:24:41.9628194Z ---------------------------|----------------- 2025-05-07T20:24:41.9628563Z _libgcc_mutex-0.1 | main 3 KB 2025-05-07T20:24:41.9628965Z _openmp_mutex-5.1 | 1_gnu 21 KB 2025-05-07T20:24:41.9629957Z ca-certificates-2025.2.25 | h06a4308_0 129 KB 2025-05-07T20:24:41.9630532Z python-3.12.9 | h5148396_0 34.7 MB 2025-05-07T20:24:41.9631052Z setuptools-78.1.1 | py312h06a4308_0 2.2 MB 2025-05-07T20:24:41.9631460Z wheel-0.45.1 | py312h06a4308_0 147 KB 2025-05-07T20:24:41.9631834Z ------------------------------------------------------------ 2025-05-07T20:24:41.9632169Z Total: 37.2 MB 2025-05-07T20:24:41.9632390Z 2025-05-07T20:24:41.9632518Z The following NEW packages will be INSTALLED: 2025-05-07T20:24:41.9632748Z 2025-05-07T20:24:41.9633196Z _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main 2025-05-07T20:24:41.9633651Z _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 2025-05-07T20:24:41.9634068Z bzip2 pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 2025-05-07T20:24:41.9634554Z ca-certificates pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 2025-05-07T20:24:41.9635055Z expat pkgs/main/linux-64::expat-2.7.1-h6a678d5_0 2025-05-07T20:24:41.9635509Z ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 2025-05-07T20:24:41.9635966Z libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 2025-05-07T20:24:41.9636410Z libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 2025-05-07T20:24:41.9636926Z libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 2025-05-07T20:24:41.9637591Z libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 2025-05-07T20:24:41.9638233Z libuuid pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 2025-05-07T20:24:41.9638665Z ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 2025-05-07T20:24:41.9639091Z openssl pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 2025-05-07T20:24:41.9639493Z pip pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T20:24:41.9639901Z python pkgs/main/linux-64::python-3.12.9-h5148396_0 2025-05-07T20:24:41.9640333Z readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 2025-05-07T20:24:41.9640815Z setuptools pkgs/main/linux-64::setuptools-78.1.1-py312h06a4308_0 2025-05-07T20:24:41.9641281Z sqlite pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 2025-05-07T20:24:41.9641670Z tk pkgs/main/linux-64::tk-8.6.14-h39e8969_0 2025-05-07T20:24:41.9642052Z tzdata pkgs/main/noarch::tzdata-2025b-h04d1e81_0 2025-05-07T20:24:41.9642472Z wheel pkgs/main/linux-64::wheel-0.45.1-py312h06a4308_0 2025-05-07T20:24:41.9642868Z xz pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 2025-05-07T20:24:41.9643239Z zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 2025-05-07T20:24:41.9643483Z 2025-05-07T20:24:41.9643493Z 2025-05-07T20:24:41.9643497Z 2025-05-07T20:24:41.9643637Z Downloading and Extracting Packages: ...working... 2025-05-07T20:24:41.9644028Z python-3.12.9 | 34.7 MB | | 0% 2025-05-07T20:24:41.9644257Z 2025-05-07T20:24:41.9644585Z setuptools-78.1.1 | 2.2 MB | | 0%  2025-05-07T20:24:41.9644837Z 2025-05-07T20:24:41.9644841Z 2025-05-07T20:24:41.9660156Z wheel-0.45.1 | 147 KB | | 0%  2025-05-07T20:24:41.9660508Z 2025-05-07T20:24:41.9660514Z 2025-05-07T20:24:41.9660519Z 2025-05-07T20:24:41.9670802Z ca-certificates-2025 | 129 KB | | 0%  2025-05-07T20:24:41.9671204Z 2025-05-07T20:24:41.9671209Z 2025-05-07T20:24:41.9671214Z 2025-05-07T20:24:41.9671219Z 2025-05-07T20:24:41.9689992Z _openmp_mutex-5.1 | 21 KB | | 0%  2025-05-07T20:24:41.9690382Z 2025-05-07T20:24:41.9690387Z 2025-05-07T20:24:41.9690392Z 2025-05-07T20:24:41.9690398Z 2025-05-07T20:24:41.9692726Z 2025-05-07T20:24:42.0119049Z _libgcc_mutex-0.1 | 3 KB | | 0%  2025-05-07T20:24:42.0119345Z 2025-05-07T20:24:42.0121769Z 2025-05-07T20:24:42.0206230Z wheel-0.45.1 | 147 KB | ########## | 100%  2025-05-07T20:24:42.0206572Z 2025-05-07T20:24:42.0206578Z 2025-05-07T20:24:42.0206593Z 2025-05-07T20:24:42.0211784Z 2025-05-07T20:24:42.0339533Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:42.0339915Z 2025-05-07T20:24:42.0339920Z 2025-05-07T20:24:42.0339936Z 2025-05-07T20:24:42.0431144Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:42.0431546Z 2025-05-07T20:24:42.0431551Z 2025-05-07T20:24:42.0431556Z 2025-05-07T20:24:42.0431561Z 2025-05-07T20:24:42.0431577Z 2025-05-07T20:24:42.0631991Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:42.0632605Z python-3.12.9 | 34.7 MB | 5 | 5% 2025-05-07T20:24:42.0632942Z 2025-05-07T20:24:42.0751690Z setuptools-78.1.1 | 2.2 MB | ####5 | 46%  2025-05-07T20:24:42.0752057Z 2025-05-07T20:24:42.0752063Z 2025-05-07T20:24:42.0752659Z 2025-05-07T20:24:42.0760983Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:42.0761389Z 2025-05-07T20:24:42.0761395Z 2025-05-07T20:24:42.0761401Z 2025-05-07T20:24:42.0970775Z ca-certificates-2025 | 129 KB | ########## | 100%  2025-05-07T20:24:42.0971183Z 2025-05-07T20:24:42.0971188Z 2025-05-07T20:24:42.0971194Z 2025-05-07T20:24:42.0971198Z 2025-05-07T20:24:42.0971203Z 2025-05-07T20:24:42.0982652Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:42.0983048Z 2025-05-07T20:24:42.0983054Z 2025-05-07T20:24:42.0983059Z 2025-05-07T20:24:42.0983064Z 2025-05-07T20:24:42.0985470Z 2025-05-07T20:24:42.1299249Z _libgcc_mutex-0.1 | 3 KB | ########## | 100%  2025-05-07T20:24:42.1299647Z 2025-05-07T20:24:42.1299652Z 2025-05-07T20:24:42.1299657Z 2025-05-07T20:24:42.1300443Z 2025-05-07T20:24:42.1309141Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:42.1309527Z 2025-05-07T20:24:42.1309533Z 2025-05-07T20:24:42.1309538Z 2025-05-07T20:24:42.1309543Z 2025-05-07T20:24:42.1327539Z _openmp_mutex-5.1 | 21 KB | ########## | 100%  2025-05-07T20:24:42.1329429Z 2025-05-07T20:24:42.1440892Z setuptools-78.1.1 | 2.2 MB | ########## | 100%  2025-05-07T20:24:42.1441158Z 2025-05-07T20:24:42.1441162Z 2025-05-07T20:24:42.1442010Z wheel-0.45.1 | 147 KB | ########## | 100%  2025-05-07T20:24:42.1442262Z 2025-05-07T20:24:42.1442266Z 2025-05-07T20:24:42.1690864Z wheel-0.45.1 | 147 KB | ########## | 100%  2025-05-07T20:24:42.2691134Z python-3.12.9 | 34.7 MB | #8 | 18% 2025-05-07T20:24:42.3692002Z python-3.12.9 | 34.7 MB | ###6 | 36% 2025-05-07T20:24:42.4593679Z python-3.12.9 | 34.7 MB | #######4 | 75% 2025-05-07T20:24:42.4593944Z 2025-05-07T20:24:42.4595939Z setuptools-78.1.1 | 2.2 MB | ########## | 100%  2025-05-07T20:24:42.4596206Z 2025-05-07T20:24:42.4755425Z setuptools-78.1.1 | 2.2 MB | ########## | 100%  2025-05-07T20:24:42.4755826Z python-3.12.9 | 34.7 MB | ########## | 100% 2025-05-07T20:24:43.0996127Z python-3.12.9 | 34.7 MB | ########## | 100% 2025-05-07T20:24:43.1003065Z python-3.12.9 | 34.7 MB | ########## | 100% 2025-05-07T20:24:43.1003429Z 2025-05-07T20:24:43.1003636Z 2025-05-07T20:24:43.1003861Z  2025-05-07T20:24:43.1004117Z 2025-05-07T20:24:43.1004123Z 2025-05-07T20:24:43.1004354Z  2025-05-07T20:24:43.1004645Z 2025-05-07T20:24:43.1004674Z 2025-05-07T20:24:43.1004679Z 2025-05-07T20:24:43.1004897Z  2025-05-07T20:24:43.1005171Z 2025-05-07T20:24:43.1005183Z 2025-05-07T20:24:43.1005188Z 2025-05-07T20:24:43.1005193Z 2025-05-07T20:24:43.1005411Z  2025-05-07T20:24:43.1005881Z 2025-05-07T20:24:43.1005884Z 2025-05-07T20:24:43.1005888Z 2025-05-07T20:24:43.1005899Z 2025-05-07T20:24:43.1005902Z 2025-05-07T20:24:43.1006087Z  done 2025-05-07T20:24:43.3112463Z Preparing transaction: | / done 2025-05-07T20:24:44.7395979Z Verifying transaction: \ | / - \ | / - \ | / - \ done 2025-05-07T20:24:47.1577547Z Executing transaction: / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:24:47.2084864Z # 2025-05-07T20:24:47.2085530Z # To activate this environment, use 2025-05-07T20:24:47.2085842Z # 2025-05-07T20:24:47.2086045Z # $ conda activate build_binary 2025-05-07T20:24:47.2086317Z # 2025-05-07T20:24:47.2086536Z # To deactivate an active environment, use 2025-05-07T20:24:47.2086834Z # 2025-05-07T20:24:47.2087026Z # $ conda deactivate 2025-05-07T20:24:47.2087196Z 2025-05-07T20:24:47.3252597Z [SETUP] Upgrading PIP to latest ... 2025-05-07T20:24:47.3277184Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --upgrade pip 2025-05-07T20:24:50.3485173Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (25.1) 2025-05-07T20:24:50.3485847Z Collecting pip 2025-05-07T20:24:50.3486166Z Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB) 2025-05-07T20:24:50.3486599Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB) 2025-05-07T20:24:50.3489509Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 90.7 MB/s eta 0:00:00 2025-05-07T20:24:50.3489885Z Installing collected packages: pip 2025-05-07T20:24:50.3490239Z Attempting uninstall: pip 2025-05-07T20:24:50.3490533Z Found existing installation: pip 25.1 2025-05-07T20:24:50.3490843Z Uninstalling pip-25.1: 2025-05-07T20:24:50.3491133Z Successfully uninstalled pip-25.1 2025-05-07T20:24:50.3491447Z Successfully installed pip-25.1.1 2025-05-07T20:24:50.3491662Z 2025-05-07T20:24:50.4173026Z [SETUP] Upgrading pyOpenSSL ... 2025-05-07T20:24:50.4197177Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0 2025-05-07T20:24:51.2771802Z Channels: 2025-05-07T20:24:51.2772045Z - conda-forge 2025-05-07T20:24:51.2772280Z Platform: linux-64 2025-05-07T20:25:01.8836377Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done 2025-05-07T20:25:03.6035726Z Solving environment: / - \ | / done 2025-05-07T20:25:03.6656235Z 2025-05-07T20:25:03.6656548Z ## Package Plan ## 2025-05-07T20:25:03.6656739Z 2025-05-07T20:25:03.6657049Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:03.6657469Z 2025-05-07T20:25:03.6657605Z added / updated specs: 2025-05-07T20:25:03.6657982Z - pyopenssl[version='>22.1.0'] 2025-05-07T20:25:03.6658184Z 2025-05-07T20:25:03.6658189Z 2025-05-07T20:25:03.6658327Z The following packages will be downloaded: 2025-05-07T20:25:03.6658545Z 2025-05-07T20:25:03.6658672Z package | build 2025-05-07T20:25:03.6659054Z ---------------------------|----------------- 2025-05-07T20:25:03.6659597Z cffi-1.17.1 | py312h06ac9bb_0 288 KB conda-forge 2025-05-07T20:25:03.6660293Z cryptography-44.0.3 | py312hda17c39_0 1.5 MB conda-forge 2025-05-07T20:25:03.6660827Z expat-2.7.0 | h5888daf_0 137 KB conda-forge 2025-05-07T20:25:03.6661238Z libexpat-2.7.0 | h5888daf_0 73 KB conda-forge 2025-05-07T20:25:03.6661674Z libgcc-15.1.0 | h767d61c_2 810 KB conda-forge 2025-05-07T20:25:03.6662097Z libgcc-ng-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:25:03.6662519Z libgomp-15.1.0 | h767d61c_2 442 KB conda-forge 2025-05-07T20:25:03.6662942Z libnsl-2.0.1 | hd590300_0 33 KB conda-forge 2025-05-07T20:25:03.6663692Z libsqlite-3.46.0 | hde9e2c9_0 845 KB conda-forge 2025-05-07T20:25:03.6664126Z libuuid-2.38.1 | h0b41bf4_0 33 KB conda-forge 2025-05-07T20:25:03.6664551Z libxcrypt-4.4.36 | hd590300_1 98 KB conda-forge 2025-05-07T20:25:03.6664984Z libzlib-1.2.13 | h4ab18f5_6 60 KB conda-forge 2025-05-07T20:25:03.6665406Z openssl-3.5.0 | h7b32b05_1 3.0 MB conda-forge 2025-05-07T20:25:03.6665980Z pycparser-2.22 | pyh29332c3_1 108 KB conda-forge 2025-05-07T20:25:03.6666670Z pyopenssl-25.0.0 | pyhd8ed1ab_0 120 KB conda-forge 2025-05-07T20:25:03.6667131Z python-3.12.2 |hab00c5b_0_cpython 30.8 MB conda-forge 2025-05-07T20:25:03.6667585Z python_abi-3.12 | 7_cp312 7 KB conda-forge 2025-05-07T20:25:03.6668053Z typing-extensions-4.13.2 | h0e9735f_0 88 KB conda-forge 2025-05-07T20:25:03.6668553Z typing_extensions-4.13.2 | pyh29332c3_0 51 KB conda-forge 2025-05-07T20:25:03.6669003Z zlib-1.2.13 | h4ab18f5_6 91 KB conda-forge 2025-05-07T20:25:03.6669396Z ------------------------------------------------------------ 2025-05-07T20:25:03.6669743Z Total: 38.6 MB 2025-05-07T20:25:03.6669968Z 2025-05-07T20:25:03.6670096Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:03.6670319Z 2025-05-07T20:25:03.6670531Z cffi conda-forge/linux-64::cffi-1.17.1-py312h06ac9bb_0 2025-05-07T20:25:03.6671049Z cryptography conda-forge/linux-64::cryptography-44.0.3-py312hda17c39_0 2025-05-07T20:25:03.6671572Z libexpat conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 2025-05-07T20:25:03.6672024Z libgcc conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 2025-05-07T20:25:03.6672465Z libnsl conda-forge/linux-64::libnsl-2.0.1-hd590300_0 2025-05-07T20:25:03.6675566Z libsqlite conda-forge/linux-64::libsqlite-3.46.0-hde9e2c9_0 2025-05-07T20:25:03.6676115Z libxcrypt conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 2025-05-07T20:25:03.6676796Z libzlib conda-forge/linux-64::libzlib-1.2.13-h4ab18f5_6 2025-05-07T20:25:03.6677312Z pycparser conda-forge/noarch::pycparser-2.22-pyh29332c3_1 2025-05-07T20:25:03.6677798Z pyopenssl conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 2025-05-07T20:25:03.6678292Z python_abi conda-forge/noarch::python_abi-3.12-7_cp312 2025-05-07T20:25:03.6678821Z typing-extensions conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 2025-05-07T20:25:03.6679428Z typing_extensions conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 2025-05-07T20:25:03.6679784Z 2025-05-07T20:25:03.6679923Z The following packages will be UPDATED: 2025-05-07T20:25:03.6680244Z 2025-05-07T20:25:03.6680653Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T20:25:03.6681453Z libgcc-ng pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 2025-05-07T20:25:03.6682134Z libgomp pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 2025-05-07T20:25:03.6682797Z libuuid pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 2025-05-07T20:25:03.6683455Z openssl pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 2025-05-07T20:25:03.6684174Z zlib pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.2.13-h4ab18f5_6 2025-05-07T20:25:03.6684534Z 2025-05-07T20:25:03.6684765Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:25:03.6685218Z 2025-05-07T20:25:03.6685474Z expat pkgs/main::expat-2.7.1-h6a678d5_0 --> conda-forge::expat-2.7.0-h5888daf_0 2025-05-07T20:25:03.6686115Z python pkgs/main::python-3.12.9-h5148396_0 --> conda-forge::python-3.12.2-hab00c5b_0_cpython 2025-05-07T20:25:03.6686518Z 2025-05-07T20:25:03.6686522Z 2025-05-07T20:25:03.6686526Z 2025-05-07T20:25:03.6686674Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:03.6687062Z python-3.12.2 | 30.8 MB | | 0% 2025-05-07T20:25:03.6687298Z 2025-05-07T20:25:03.6687729Z openssl-3.5.0 | 3.0 MB | | 0%  2025-05-07T20:25:03.6687970Z 2025-05-07T20:25:03.6687974Z 2025-05-07T20:25:03.6688333Z cryptography-44.0.3 | 1.5 MB | | 0%  2025-05-07T20:25:03.6688707Z 2025-05-07T20:25:03.6688713Z 2025-05-07T20:25:03.6688723Z 2025-05-07T20:25:03.6704798Z libsqlite-3.46.0 | 845 KB | | 0%  2025-05-07T20:25:03.6705067Z 2025-05-07T20:25:03.6705081Z 2025-05-07T20:25:03.6705084Z 2025-05-07T20:25:03.6706637Z 2025-05-07T20:25:03.6719653Z libgcc-15.1.0 | 810 KB | | 0%  2025-05-07T20:25:03.6719931Z 2025-05-07T20:25:03.6719935Z 2025-05-07T20:25:03.6719939Z 2025-05-07T20:25:03.6719950Z 2025-05-07T20:25:03.6726143Z 2025-05-07T20:25:03.6736205Z libgomp-15.1.0 | 442 KB | | 0%  2025-05-07T20:25:03.6736495Z 2025-05-07T20:25:03.6736500Z 2025-05-07T20:25:03.6736512Z 2025-05-07T20:25:03.6736516Z 2025-05-07T20:25:03.6736519Z 2025-05-07T20:25:03.6736523Z 2025-05-07T20:25:03.6739544Z cffi-1.17.1 | 288 KB | | 0%  2025-05-07T20:25:03.6739870Z 2025-05-07T20:25:03.6739876Z 2025-05-07T20:25:03.6739882Z 2025-05-07T20:25:03.6739887Z 2025-05-07T20:25:03.6739892Z 2025-05-07T20:25:03.6739897Z 2025-05-07T20:25:03.6739902Z 2025-05-07T20:25:03.6743189Z expat-2.7.0 | 137 KB | | 0%  2025-05-07T20:25:03.6743531Z 2025-05-07T20:25:03.6743536Z 2025-05-07T20:25:03.6743539Z 2025-05-07T20:25:03.6743543Z 2025-05-07T20:25:03.6743546Z 2025-05-07T20:25:03.6743550Z 2025-05-07T20:25:03.6743553Z 2025-05-07T20:25:03.6743556Z 2025-05-07T20:25:03.6744667Z pyopenssl-25.0.0 | 120 KB | | 0%  2025-05-07T20:25:03.6744983Z 2025-05-07T20:25:03.6744987Z 2025-05-07T20:25:03.6744990Z 2025-05-07T20:25:03.6744994Z 2025-05-07T20:25:03.6744998Z 2025-05-07T20:25:03.6745001Z 2025-05-07T20:25:03.6745005Z 2025-05-07T20:25:03.6745008Z 2025-05-07T20:25:03.6745015Z 2025-05-07T20:25:03.6746693Z pycparser-2.22 | 108 KB | | 0%  2025-05-07T20:25:03.6746996Z 2025-05-07T20:25:03.6747000Z 2025-05-07T20:25:03.6747004Z 2025-05-07T20:25:03.6747007Z 2025-05-07T20:25:03.6747011Z 2025-05-07T20:25:03.6747014Z 2025-05-07T20:25:03.6747025Z 2025-05-07T20:25:03.6747028Z 2025-05-07T20:25:03.6747031Z 2025-05-07T20:25:03.6753908Z 2025-05-07T20:25:03.6754883Z libxcrypt-4.4.36 | 98 KB | | 0%  2025-05-07T20:25:03.6755208Z 2025-05-07T20:25:03.6755212Z 2025-05-07T20:25:03.6755215Z 2025-05-07T20:25:03.6755219Z 2025-05-07T20:25:03.6755222Z 2025-05-07T20:25:03.6755226Z 2025-05-07T20:25:03.6755229Z 2025-05-07T20:25:03.6755236Z 2025-05-07T20:25:03.6755239Z 2025-05-07T20:25:03.6755243Z 2025-05-07T20:25:03.6755246Z 2025-05-07T20:25:03.6756321Z zlib-1.2.13 | 91 KB | | 0%  2025-05-07T20:25:03.6756639Z 2025-05-07T20:25:03.6756643Z 2025-05-07T20:25:03.6756646Z 2025-05-07T20:25:03.6756650Z 2025-05-07T20:25:03.6756653Z 2025-05-07T20:25:03.6756664Z 2025-05-07T20:25:03.6756668Z 2025-05-07T20:25:03.6756676Z 2025-05-07T20:25:03.6756679Z 2025-05-07T20:25:03.6756683Z 2025-05-07T20:25:03.6756686Z 2025-05-07T20:25:03.6756690Z 2025-05-07T20:25:03.6758028Z typing-extensions-4. | 88 KB | | 0%  2025-05-07T20:25:03.6758551Z 2025-05-07T20:25:03.6758560Z 2025-05-07T20:25:03.6758564Z 2025-05-07T20:25:03.6758567Z 2025-05-07T20:25:03.6758571Z 2025-05-07T20:25:03.6758574Z 2025-05-07T20:25:03.6758578Z 2025-05-07T20:25:03.6758581Z 2025-05-07T20:25:03.6758592Z 2025-05-07T20:25:03.6758596Z 2025-05-07T20:25:03.6758599Z 2025-05-07T20:25:03.6758603Z 2025-05-07T20:25:03.6758606Z 2025-05-07T20:25:03.6759498Z libexpat-2.7.0 | 73 KB | | 0%  2025-05-07T20:25:03.6759825Z 2025-05-07T20:25:03.6759835Z 2025-05-07T20:25:03.6759839Z 2025-05-07T20:25:03.6759842Z 2025-05-07T20:25:03.6759846Z 2025-05-07T20:25:03.6759854Z 2025-05-07T20:25:03.6760023Z 2025-05-07T20:25:03.6760028Z 2025-05-07T20:25:03.6760032Z 2025-05-07T20:25:03.6760035Z 2025-05-07T20:25:03.6760039Z 2025-05-07T20:25:03.6760042Z 2025-05-07T20:25:03.6760046Z 2025-05-07T20:25:03.6760049Z 2025-05-07T20:25:03.6761096Z libzlib-1.2.13 | 60 KB | | 0%  2025-05-07T20:25:03.6761424Z 2025-05-07T20:25:03.6761428Z 2025-05-07T20:25:03.6761431Z 2025-05-07T20:25:03.6761440Z 2025-05-07T20:25:03.6761444Z 2025-05-07T20:25:03.6761447Z 2025-05-07T20:25:03.6761451Z 2025-05-07T20:25:03.6761454Z 2025-05-07T20:25:03.6761457Z 2025-05-07T20:25:03.6761461Z 2025-05-07T20:25:03.6761464Z 2025-05-07T20:25:03.6761468Z 2025-05-07T20:25:03.6761471Z 2025-05-07T20:25:03.6761481Z 2025-05-07T20:25:03.6761485Z 2025-05-07T20:25:03.6763539Z typing_extensions-4. | 51 KB | | 0%  2025-05-07T20:25:03.6763877Z 2025-05-07T20:25:03.6763881Z 2025-05-07T20:25:03.6763892Z 2025-05-07T20:25:03.6763905Z 2025-05-07T20:25:03.6763908Z 2025-05-07T20:25:03.6763912Z 2025-05-07T20:25:03.6763915Z 2025-05-07T20:25:03.6763919Z 2025-05-07T20:25:03.6763922Z 2025-05-07T20:25:03.6763926Z 2025-05-07T20:25:03.6763929Z 2025-05-07T20:25:03.6763933Z 2025-05-07T20:25:03.6763939Z 2025-05-07T20:25:03.6763950Z 2025-05-07T20:25:03.6763954Z 2025-05-07T20:25:03.6763957Z 2025-05-07T20:25:03.6764694Z libgcc-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:25:03.6765120Z 2025-05-07T20:25:03.6765135Z 2025-05-07T20:25:03.6765140Z 2025-05-07T20:25:03.6765146Z 2025-05-07T20:25:03.6765151Z 2025-05-07T20:25:03.6765156Z 2025-05-07T20:25:03.6765161Z 2025-05-07T20:25:03.6765166Z 2025-05-07T20:25:03.6765171Z 2025-05-07T20:25:03.6765176Z 2025-05-07T20:25:03.6765181Z 2025-05-07T20:25:03.6765186Z 2025-05-07T20:25:03.6765191Z 2025-05-07T20:25:03.6765196Z 2025-05-07T20:25:03.6765201Z 2025-05-07T20:25:03.6765206Z 2025-05-07T20:25:03.6765216Z 2025-05-07T20:25:03.6766106Z libuuid-2.38.1 | 33 KB | | 0%  2025-05-07T20:25:03.6766459Z 2025-05-07T20:25:03.6766463Z 2025-05-07T20:25:03.6766467Z 2025-05-07T20:25:03.6766471Z 2025-05-07T20:25:03.6766475Z 2025-05-07T20:25:03.6766478Z 2025-05-07T20:25:03.6766486Z 2025-05-07T20:25:03.6766490Z 2025-05-07T20:25:03.6766493Z 2025-05-07T20:25:03.6766497Z 2025-05-07T20:25:03.6766500Z 2025-05-07T20:25:03.6766510Z 2025-05-07T20:25:03.6766514Z 2025-05-07T20:25:03.6766517Z 2025-05-07T20:25:03.6766521Z 2025-05-07T20:25:03.6766524Z 2025-05-07T20:25:03.6766527Z 2025-05-07T20:25:03.6766535Z 2025-05-07T20:25:03.6767427Z libnsl-2.0.1 | 33 KB | | 0%  2025-05-07T20:25:03.6767809Z 2025-05-07T20:25:03.6767813Z 2025-05-07T20:25:03.6767825Z 2025-05-07T20:25:03.6767828Z 2025-05-07T20:25:03.6767832Z 2025-05-07T20:25:03.6767835Z 2025-05-07T20:25:03.6767839Z 2025-05-07T20:25:03.6767850Z 2025-05-07T20:25:03.6767853Z 2025-05-07T20:25:03.6767857Z 2025-05-07T20:25:03.6767861Z 2025-05-07T20:25:03.6767864Z 2025-05-07T20:25:03.6767867Z 2025-05-07T20:25:03.6767871Z 2025-05-07T20:25:03.6767874Z 2025-05-07T20:25:03.6767878Z 2025-05-07T20:25:03.6767881Z 2025-05-07T20:25:03.6768045Z 2025-05-07T20:25:03.6768048Z 2025-05-07T20:25:03.7639758Z ... (more hidden) ... 2025-05-07T20:25:03.7640088Z 2025-05-07T20:25:03.7640091Z 2025-05-07T20:25:03.7666957Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:25:03.7675219Z python-3.12.2 | 30.8 MB | | 1% 2025-05-07T20:25:03.7675532Z 2025-05-07T20:25:03.7710308Z openssl-3.5.0 | 3.0 MB | 6 | 7%  2025-05-07T20:25:03.7710588Z 2025-05-07T20:25:03.7710592Z 2025-05-07T20:25:03.7710596Z 2025-05-07T20:25:03.7712047Z 2025-05-07T20:25:03.7717431Z libgcc-15.1.0 | 810 KB | ######5 | 65%  2025-05-07T20:25:03.7717925Z 2025-05-07T20:25:03.7717932Z 2025-05-07T20:25:03.7717940Z 2025-05-07T20:25:03.8026726Z libsqlite-3.46.0 | 845 KB | 1 | 2%  2025-05-07T20:25:03.8027012Z 2025-05-07T20:25:03.8027016Z 2025-05-07T20:25:03.8027020Z 2025-05-07T20:25:03.8028677Z 2025-05-07T20:25:03.8140478Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:25:03.8140805Z 2025-05-07T20:25:03.8140903Z 2025-05-07T20:25:03.8140908Z 2025-05-07T20:25:03.8140913Z 2025-05-07T20:25:03.8140923Z 2025-05-07T20:25:03.8299958Z libgomp-15.1.0 | 442 KB | 3 | 4%  2025-05-07T20:25:03.8300242Z 2025-05-07T20:25:03.8300246Z 2025-05-07T20:25:03.8303183Z 2025-05-07T20:25:03.8425080Z libsqlite-3.46.0 | 845 KB | ########## | 100%  2025-05-07T20:25:03.8425355Z 2025-05-07T20:25:03.8425359Z 2025-05-07T20:25:03.8425363Z 2025-05-07T20:25:03.8425367Z 2025-05-07T20:25:03.8428720Z 2025-05-07T20:25:03.8564207Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:25:03.8564489Z 2025-05-07T20:25:03.8564493Z 2025-05-07T20:25:03.8564496Z 2025-05-07T20:25:03.8564500Z 2025-05-07T20:25:03.8564503Z 2025-05-07T20:25:03.8564507Z 2025-05-07T20:25:03.8667964Z cffi-1.17.1 | 288 KB | 5 | 6%  2025-05-07T20:25:03.8735865Z python-3.12.2 | 30.8 MB | #1 | 11% 2025-05-07T20:25:03.8736106Z 2025-05-07T20:25:03.8736400Z 2025-05-07T20:25:03.8736406Z 2025-05-07T20:25:03.8736410Z 2025-05-07T20:25:03.8736414Z 2025-05-07T20:25:03.8737686Z 2025-05-07T20:25:03.8828696Z cffi-1.17.1 | 288 KB | ########## | 100%  2025-05-07T20:25:03.8829088Z 2025-05-07T20:25:03.8829094Z 2025-05-07T20:25:03.8829100Z 2025-05-07T20:25:03.8829106Z 2025-05-07T20:25:03.8829112Z 2025-05-07T20:25:03.8829118Z 2025-05-07T20:25:03.8829123Z 2025-05-07T20:25:03.8947447Z expat-2.7.0 | 137 KB | #1 | 12%  2025-05-07T20:25:03.8947767Z 2025-05-07T20:25:03.8947774Z 2025-05-07T20:25:03.8947779Z 2025-05-07T20:25:03.8947785Z 2025-05-07T20:25:03.8947789Z 2025-05-07T20:25:03.8947794Z 2025-05-07T20:25:03.8949174Z 2025-05-07T20:25:03.9013532Z expat-2.7.0 | 137 KB | ########## | 100%  2025-05-07T20:25:03.9013931Z 2025-05-07T20:25:03.9013935Z 2025-05-07T20:25:03.9013938Z 2025-05-07T20:25:03.9013942Z 2025-05-07T20:25:03.9013946Z 2025-05-07T20:25:03.9013949Z 2025-05-07T20:25:03.9013960Z 2025-05-07T20:25:03.9016165Z 2025-05-07T20:25:03.9113170Z pyopenssl-25.0.0 | 120 KB | #3 | 13%  2025-05-07T20:25:03.9113489Z 2025-05-07T20:25:03.9113493Z 2025-05-07T20:25:03.9113507Z 2025-05-07T20:25:03.9113511Z 2025-05-07T20:25:03.9113514Z 2025-05-07T20:25:03.9113518Z 2025-05-07T20:25:03.9113521Z 2025-05-07T20:25:03.9117896Z 2025-05-07T20:25:03.9308750Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:25:03.9309116Z 2025-05-07T20:25:03.9309133Z 2025-05-07T20:25:03.9309137Z 2025-05-07T20:25:03.9309141Z 2025-05-07T20:25:03.9309144Z 2025-05-07T20:25:03.9309148Z 2025-05-07T20:25:03.9309152Z 2025-05-07T20:25:03.9309155Z 2025-05-07T20:25:03.9318810Z 2025-05-07T20:25:03.9328782Z pycparser-2.22 | 108 KB | #4 | 15%  2025-05-07T20:25:03.9333006Z 2025-05-07T20:25:03.9333427Z openssl-3.5.0 | 3.0 MB | ########## | 100%  2025-05-07T20:25:03.9333690Z 2025-05-07T20:25:03.9420148Z openssl-3.5.0 | 3.0 MB | ########## | 100%  2025-05-07T20:25:03.9420509Z 2025-05-07T20:25:03.9420524Z 2025-05-07T20:25:03.9420530Z 2025-05-07T20:25:03.9420535Z 2025-05-07T20:25:03.9420540Z 2025-05-07T20:25:03.9420546Z 2025-05-07T20:25:03.9420552Z 2025-05-07T20:25:03.9420557Z 2025-05-07T20:25:03.9421401Z 2025-05-07T20:25:03.9528072Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:25:03.9528447Z 2025-05-07T20:25:03.9528451Z 2025-05-07T20:25:03.9528674Z 2025-05-07T20:25:03.9528681Z 2025-05-07T20:25:03.9528686Z 2025-05-07T20:25:03.9528691Z 2025-05-07T20:25:03.9528696Z 2025-05-07T20:25:03.9528701Z 2025-05-07T20:25:03.9528706Z 2025-05-07T20:25:03.9528711Z 2025-05-07T20:25:03.9528716Z 2025-05-07T20:25:03.9541975Z zlib-1.2.13 | 91 KB | #7 | 18%  2025-05-07T20:25:03.9542303Z 2025-05-07T20:25:03.9542309Z 2025-05-07T20:25:03.9543754Z 2025-05-07T20:25:03.9555901Z libsqlite-3.46.0 | 845 KB | ########## | 100%  2025-05-07T20:25:03.9556306Z 2025-05-07T20:25:03.9556311Z 2025-05-07T20:25:03.9557124Z 2025-05-07T20:25:03.9589522Z libsqlite-3.46.0 | 845 KB | ########## | 100%  2025-05-07T20:25:03.9589815Z 2025-05-07T20:25:03.9589819Z 2025-05-07T20:25:03.9589823Z 2025-05-07T20:25:03.9589826Z 2025-05-07T20:25:03.9589830Z 2025-05-07T20:25:03.9589833Z 2025-05-07T20:25:03.9589836Z 2025-05-07T20:25:03.9589840Z 2025-05-07T20:25:03.9589844Z 2025-05-07T20:25:03.9594455Z 2025-05-07T20:25:03.9596473Z libxcrypt-4.4.36 | 98 KB | #6 | 16%  2025-05-07T20:25:03.9596850Z 2025-05-07T20:25:03.9596855Z 2025-05-07T20:25:03.9596858Z 2025-05-07T20:25:03.9596862Z 2025-05-07T20:25:03.9596865Z 2025-05-07T20:25:03.9596878Z 2025-05-07T20:25:03.9596882Z 2025-05-07T20:25:03.9596885Z 2025-05-07T20:25:03.9596896Z 2025-05-07T20:25:03.9596900Z 2025-05-07T20:25:03.9597630Z 2025-05-07T20:25:03.9671398Z zlib-1.2.13 | 91 KB | ########## | 100%  2025-05-07T20:25:03.9675090Z python-3.12.2 | 30.8 MB | ##2 | 23% 2025-05-07T20:25:03.9675436Z 2025-05-07T20:25:03.9675440Z 2025-05-07T20:25:03.9675444Z 2025-05-07T20:25:03.9675447Z 2025-05-07T20:25:03.9675451Z 2025-05-07T20:25:03.9675454Z 2025-05-07T20:25:03.9675458Z 2025-05-07T20:25:03.9675470Z 2025-05-07T20:25:03.9675474Z 2025-05-07T20:25:03.9676088Z 2025-05-07T20:25:03.9725006Z libxcrypt-4.4.36 | 98 KB | ########## | 100%  2025-05-07T20:25:03.9725654Z 2025-05-07T20:25:03.9725659Z 2025-05-07T20:25:03.9725665Z 2025-05-07T20:25:03.9725670Z 2025-05-07T20:25:03.9725676Z 2025-05-07T20:25:03.9725682Z 2025-05-07T20:25:03.9725687Z 2025-05-07T20:25:03.9725693Z 2025-05-07T20:25:03.9725744Z 2025-05-07T20:25:03.9725753Z 2025-05-07T20:25:03.9725760Z 2025-05-07T20:25:03.9727929Z 2025-05-07T20:25:03.9771134Z typing-extensions-4. | 88 KB | #8 | 18%  2025-05-07T20:25:03.9771484Z 2025-05-07T20:25:03.9771490Z 2025-05-07T20:25:03.9771493Z 2025-05-07T20:25:03.9771497Z 2025-05-07T20:25:03.9771500Z 2025-05-07T20:25:03.9771504Z 2025-05-07T20:25:03.9771507Z 2025-05-07T20:25:03.9771511Z 2025-05-07T20:25:03.9771514Z 2025-05-07T20:25:03.9771518Z 2025-05-07T20:25:03.9771522Z 2025-05-07T20:25:03.9774014Z 2025-05-07T20:25:03.9867785Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:25:03.9868311Z 2025-05-07T20:25:03.9868319Z 2025-05-07T20:25:03.9868324Z 2025-05-07T20:25:03.9868330Z 2025-05-07T20:25:03.9868336Z 2025-05-07T20:25:03.9868341Z 2025-05-07T20:25:03.9868357Z 2025-05-07T20:25:03.9868363Z 2025-05-07T20:25:03.9868370Z 2025-05-07T20:25:03.9868376Z 2025-05-07T20:25:03.9868654Z 2025-05-07T20:25:03.9868659Z 2025-05-07T20:25:03.9869448Z 2025-05-07T20:25:03.9930344Z libexpat-2.7.0 | 73 KB | ##2 | 22%  2025-05-07T20:25:03.9930769Z 2025-05-07T20:25:03.9930774Z 2025-05-07T20:25:03.9930780Z 2025-05-07T20:25:03.9930785Z 2025-05-07T20:25:03.9930791Z 2025-05-07T20:25:03.9930807Z 2025-05-07T20:25:03.9930812Z 2025-05-07T20:25:03.9930817Z 2025-05-07T20:25:03.9930823Z 2025-05-07T20:25:03.9930828Z 2025-05-07T20:25:03.9930833Z 2025-05-07T20:25:03.9930838Z 2025-05-07T20:25:03.9932702Z 2025-05-07T20:25:04.0163035Z libexpat-2.7.0 | 73 KB | ########## | 100%  2025-05-07T20:25:04.0163689Z 2025-05-07T20:25:04.0163695Z 2025-05-07T20:25:04.0163698Z 2025-05-07T20:25:04.0163702Z 2025-05-07T20:25:04.0163705Z 2025-05-07T20:25:04.0163709Z 2025-05-07T20:25:04.0163712Z 2025-05-07T20:25:04.0163716Z 2025-05-07T20:25:04.0163719Z 2025-05-07T20:25:04.0163723Z 2025-05-07T20:25:04.0163738Z 2025-05-07T20:25:04.0163741Z 2025-05-07T20:25:04.0163745Z 2025-05-07T20:25:04.0164110Z 2025-05-07T20:25:04.0215249Z libzlib-1.2.13 | 60 KB | ##6 | 27%  2025-05-07T20:25:04.0215572Z 2025-05-07T20:25:04.0215576Z 2025-05-07T20:25:04.0215580Z 2025-05-07T20:25:04.0215584Z 2025-05-07T20:25:04.0215587Z 2025-05-07T20:25:04.0215591Z 2025-05-07T20:25:04.0215595Z 2025-05-07T20:25:04.0215599Z 2025-05-07T20:25:04.0215613Z 2025-05-07T20:25:04.0215617Z 2025-05-07T20:25:04.0215620Z 2025-05-07T20:25:04.0215624Z 2025-05-07T20:25:04.0215627Z 2025-05-07T20:25:04.0215631Z 2025-05-07T20:25:04.0215634Z 2025-05-07T20:25:04.0215638Z 2025-05-07T20:25:04.0220713Z libgcc-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:25:04.0221040Z 2025-05-07T20:25:04.0221044Z 2025-05-07T20:25:04.0221048Z 2025-05-07T20:25:04.0221051Z 2025-05-07T20:25:04.0221055Z 2025-05-07T20:25:04.0221058Z 2025-05-07T20:25:04.0221068Z 2025-05-07T20:25:04.0221071Z 2025-05-07T20:25:04.0221075Z 2025-05-07T20:25:04.0221078Z 2025-05-07T20:25:04.0221082Z 2025-05-07T20:25:04.0221085Z 2025-05-07T20:25:04.0221089Z 2025-05-07T20:25:04.0222475Z 2025-05-07T20:25:04.0259932Z libzlib-1.2.13 | 60 KB | ########## | 100%  2025-05-07T20:25:04.0260243Z 2025-05-07T20:25:04.0260247Z 2025-05-07T20:25:04.0260251Z 2025-05-07T20:25:04.0260254Z 2025-05-07T20:25:04.0260258Z 2025-05-07T20:25:04.0260261Z 2025-05-07T20:25:04.0260265Z 2025-05-07T20:25:04.0260268Z 2025-05-07T20:25:04.0260272Z 2025-05-07T20:25:04.0260285Z 2025-05-07T20:25:04.0260288Z 2025-05-07T20:25:04.0260292Z 2025-05-07T20:25:04.0260308Z 2025-05-07T20:25:04.0260312Z 2025-05-07T20:25:04.0260319Z 2025-05-07T20:25:04.0271942Z typing_extensions-4. | 51 KB | ###1 | 31%  2025-05-07T20:25:04.0272296Z 2025-05-07T20:25:04.0272300Z 2025-05-07T20:25:04.0272303Z 2025-05-07T20:25:04.0272317Z 2025-05-07T20:25:04.0272321Z 2025-05-07T20:25:04.0272325Z 2025-05-07T20:25:04.0272328Z 2025-05-07T20:25:04.0272332Z 2025-05-07T20:25:04.0272335Z 2025-05-07T20:25:04.0272339Z 2025-05-07T20:25:04.0272342Z 2025-05-07T20:25:04.0272346Z 2025-05-07T20:25:04.0272349Z 2025-05-07T20:25:04.0272353Z 2025-05-07T20:25:04.0272356Z 2025-05-07T20:25:04.0275846Z 2025-05-07T20:25:04.0326674Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:04.0327036Z 2025-05-07T20:25:04.0327041Z 2025-05-07T20:25:04.0327047Z 2025-05-07T20:25:04.0327052Z 2025-05-07T20:25:04.0327057Z 2025-05-07T20:25:04.0327062Z 2025-05-07T20:25:04.0327079Z 2025-05-07T20:25:04.0327085Z 2025-05-07T20:25:04.0327099Z 2025-05-07T20:25:04.0327105Z 2025-05-07T20:25:04.0327110Z 2025-05-07T20:25:04.0327115Z 2025-05-07T20:25:04.0327120Z 2025-05-07T20:25:04.0327125Z 2025-05-07T20:25:04.0327130Z 2025-05-07T20:25:04.0402881Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:25:04.0403624Z 2025-05-07T20:25:04.0403629Z 2025-05-07T20:25:04.0403632Z 2025-05-07T20:25:04.0403636Z 2025-05-07T20:25:04.0591222Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T20:25:04.0591583Z 2025-05-07T20:25:04.0591587Z 2025-05-07T20:25:04.0591591Z 2025-05-07T20:25:04.0591595Z 2025-05-07T20:25:04.0591598Z 2025-05-07T20:25:04.0591602Z 2025-05-07T20:25:04.0591606Z 2025-05-07T20:25:04.0591609Z 2025-05-07T20:25:04.0591613Z 2025-05-07T20:25:04.0591616Z 2025-05-07T20:25:04.0591620Z 2025-05-07T20:25:04.0591623Z 2025-05-07T20:25:04.0591627Z 2025-05-07T20:25:04.0591630Z 2025-05-07T20:25:04.0591903Z 2025-05-07T20:25:04.0591908Z 2025-05-07T20:25:04.0591915Z 2025-05-07T20:25:04.0627950Z libuuid-2.38.1 | 33 KB | ####8 | 49%  2025-05-07T20:25:04.0628550Z 2025-05-07T20:25:04.0628554Z 2025-05-07T20:25:04.0628558Z 2025-05-07T20:25:04.0628575Z 2025-05-07T20:25:04.0628578Z 2025-05-07T20:25:04.0628582Z 2025-05-07T20:25:04.0628586Z 2025-05-07T20:25:04.0628589Z 2025-05-07T20:25:04.0628603Z 2025-05-07T20:25:04.0628607Z 2025-05-07T20:25:04.0628611Z 2025-05-07T20:25:04.0628614Z 2025-05-07T20:25:04.0628617Z 2025-05-07T20:25:04.0628621Z 2025-05-07T20:25:04.0628624Z 2025-05-07T20:25:04.0628628Z 2025-05-07T20:25:04.0633502Z 2025-05-07T20:25:04.0653099Z libuuid-2.38.1 | 33 KB | ########## | 100%  2025-05-07T20:25:04.0653505Z 2025-05-07T20:25:04.0653511Z 2025-05-07T20:25:04.0653517Z 2025-05-07T20:25:04.0653522Z 2025-05-07T20:25:04.0653528Z 2025-05-07T20:25:04.0653547Z 2025-05-07T20:25:04.0653553Z 2025-05-07T20:25:04.0653558Z 2025-05-07T20:25:04.0653564Z 2025-05-07T20:25:04.0653569Z 2025-05-07T20:25:04.0653574Z 2025-05-07T20:25:04.0653580Z 2025-05-07T20:25:04.0653585Z 2025-05-07T20:25:04.0653590Z 2025-05-07T20:25:04.0653595Z 2025-05-07T20:25:04.0653608Z 2025-05-07T20:25:04.0653614Z 2025-05-07T20:25:04.0653629Z 2025-05-07T20:25:04.0673451Z libnsl-2.0.1 | 33 KB | ####9 | 49%  2025-05-07T20:25:04.0683479Z python-3.12.2 | 30.8 MB | ###6 | 36% 2025-05-07T20:25:04.0683734Z 2025-05-07T20:25:04.0683738Z 2025-05-07T20:25:04.0683741Z 2025-05-07T20:25:04.0683745Z 2025-05-07T20:25:04.0683748Z 2025-05-07T20:25:04.0683751Z 2025-05-07T20:25:04.0683755Z 2025-05-07T20:25:04.0683758Z 2025-05-07T20:25:04.0683769Z 2025-05-07T20:25:04.0683773Z 2025-05-07T20:25:04.0683777Z 2025-05-07T20:25:04.0683780Z 2025-05-07T20:25:04.0683784Z 2025-05-07T20:25:04.0683787Z 2025-05-07T20:25:04.0683800Z 2025-05-07T20:25:04.0683804Z 2025-05-07T20:25:04.0683807Z 2025-05-07T20:25:04.0686200Z 2025-05-07T20:25:04.0788507Z libnsl-2.0.1 | 33 KB | ########## | 100%  2025-05-07T20:25:04.0788894Z 2025-05-07T20:25:04.0788900Z 2025-05-07T20:25:04.0788931Z 2025-05-07T20:25:04.0788937Z 2025-05-07T20:25:04.0788944Z 2025-05-07T20:25:04.0788951Z 2025-05-07T20:25:04.0788957Z 2025-05-07T20:25:04.0788964Z 2025-05-07T20:25:04.0788970Z 2025-05-07T20:25:04.0788978Z 2025-05-07T20:25:04.0788986Z 2025-05-07T20:25:04.0788993Z 2025-05-07T20:25:04.0789000Z 2025-05-07T20:25:04.0789007Z 2025-05-07T20:25:04.0789013Z 2025-05-07T20:25:04.0789018Z 2025-05-07T20:25:04.0789031Z 2025-05-07T20:25:04.0789036Z 2025-05-07T20:25:04.0789040Z 2025-05-07T20:25:04.0815596Z ... (more hidden) ... 2025-05-07T20:25:04.0815911Z 2025-05-07T20:25:04.0815915Z 2025-05-07T20:25:04.0815919Z 2025-05-07T20:25:04.0815935Z 2025-05-07T20:25:04.0815939Z 2025-05-07T20:25:04.0815943Z 2025-05-07T20:25:04.0815946Z 2025-05-07T20:25:04.0815950Z 2025-05-07T20:25:04.0815953Z 2025-05-07T20:25:04.0815957Z 2025-05-07T20:25:04.0815960Z 2025-05-07T20:25:04.0815963Z 2025-05-07T20:25:04.0815967Z 2025-05-07T20:25:04.0816172Z 2025-05-07T20:25:04.0816175Z 2025-05-07T20:25:04.0816179Z 2025-05-07T20:25:04.0816182Z 2025-05-07T20:25:04.0816186Z 2025-05-07T20:25:04.0816189Z 2025-05-07T20:25:04.1062960Z ... (more hidden) ... 2025-05-07T20:25:04.1063330Z 2025-05-07T20:25:04.1063334Z 2025-05-07T20:25:04.1063338Z 2025-05-07T20:25:04.1063342Z 2025-05-07T20:25:04.1063345Z 2025-05-07T20:25:04.1067288Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:25:04.1067680Z 2025-05-07T20:25:04.1067686Z 2025-05-07T20:25:04.1067692Z 2025-05-07T20:25:04.1067697Z 2025-05-07T20:25:04.1067702Z 2025-05-07T20:25:04.1677432Z libgomp-15.1.0 | 442 KB | ########## | 100%  2025-05-07T20:25:04.1694065Z python-3.12.2 | 30.8 MB | #####2 | 53% 2025-05-07T20:25:04.1694335Z 2025-05-07T20:25:04.1694338Z 2025-05-07T20:25:04.1694342Z 2025-05-07T20:25:04.1694345Z 2025-05-07T20:25:04.1694349Z 2025-05-07T20:25:04.1694372Z 2025-05-07T20:25:04.1694574Z 2025-05-07T20:25:04.1700200Z expat-2.7.0 | 137 KB | ########## | 100%  2025-05-07T20:25:04.1700496Z 2025-05-07T20:25:04.1700500Z 2025-05-07T20:25:04.1700503Z 2025-05-07T20:25:04.1700507Z 2025-05-07T20:25:04.1700510Z 2025-05-07T20:25:04.1700514Z 2025-05-07T20:25:04.1700651Z 2025-05-07T20:25:04.1708743Z expat-2.7.0 | 137 KB | ########## | 100%  2025-05-07T20:25:04.1709104Z 2025-05-07T20:25:04.1709109Z 2025-05-07T20:25:04.1709114Z 2025-05-07T20:25:04.1709118Z 2025-05-07T20:25:04.1709123Z 2025-05-07T20:25:04.1710924Z 2025-05-07T20:25:04.1720161Z cffi-1.17.1 | 288 KB | ########## | 100%  2025-05-07T20:25:04.1720506Z 2025-05-07T20:25:04.1720511Z 2025-05-07T20:25:04.1720517Z 2025-05-07T20:25:04.1720522Z 2025-05-07T20:25:04.1720527Z 2025-05-07T20:25:04.1723319Z 2025-05-07T20:25:04.2443670Z cffi-1.17.1 | 288 KB | ########## | 100%  2025-05-07T20:25:04.2444093Z 2025-05-07T20:25:04.2444100Z 2025-05-07T20:25:04.2444104Z 2025-05-07T20:25:04.2444108Z 2025-05-07T20:25:04.2444111Z 2025-05-07T20:25:04.2444115Z 2025-05-07T20:25:04.2444119Z 2025-05-07T20:25:04.2444122Z 2025-05-07T20:25:04.2447790Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:25:04.2448167Z 2025-05-07T20:25:04.2448173Z 2025-05-07T20:25:04.2448178Z 2025-05-07T20:25:04.2448183Z 2025-05-07T20:25:04.2448188Z 2025-05-07T20:25:04.2448203Z 2025-05-07T20:25:04.2448209Z 2025-05-07T20:25:04.2448213Z 2025-05-07T20:25:04.2676934Z pyopenssl-25.0.0 | 120 KB | ########## | 100%  2025-05-07T20:25:04.3681256Z python-3.12.2 | 30.8 MB | ######7 | 67% 2025-05-07T20:25:04.4975784Z python-3.12.2 | 30.8 MB | ########5 | 86% 2025-05-07T20:25:04.4976048Z 2025-05-07T20:25:04.4976154Z 2025-05-07T20:25:04.4976158Z 2025-05-07T20:25:04.4976174Z 2025-05-07T20:25:04.4976304Z 2025-05-07T20:25:04.4976336Z 2025-05-07T20:25:04.4976341Z 2025-05-07T20:25:04.4976345Z 2025-05-07T20:25:04.4978343Z 2025-05-07T20:25:04.4993640Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:25:04.4994006Z 2025-05-07T20:25:04.4994011Z 2025-05-07T20:25:04.4994016Z 2025-05-07T20:25:04.4994020Z 2025-05-07T20:25:04.4994025Z 2025-05-07T20:25:04.4994041Z 2025-05-07T20:25:04.4994046Z 2025-05-07T20:25:04.4994050Z 2025-05-07T20:25:04.4994055Z 2025-05-07T20:25:04.5203585Z pycparser-2.22 | 108 KB | ########## | 100%  2025-05-07T20:25:04.5203883Z 2025-05-07T20:25:04.5205025Z 2025-05-07T20:25:04.5211815Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:25:04.5212089Z 2025-05-07T20:25:04.5214873Z 2025-05-07T20:25:04.5441596Z cryptography-44.0.3 | 1.5 MB | ########## | 100%  2025-05-07T20:25:04.5441972Z 2025-05-07T20:25:04.5441978Z 2025-05-07T20:25:04.5441983Z 2025-05-07T20:25:04.5441988Z 2025-05-07T20:25:04.5442344Z 2025-05-07T20:25:04.5442364Z 2025-05-07T20:25:04.5442370Z 2025-05-07T20:25:04.5442375Z 2025-05-07T20:25:04.5442380Z 2025-05-07T20:25:04.5442385Z 2025-05-07T20:25:04.5442397Z 2025-05-07T20:25:04.5447906Z zlib-1.2.13 | 91 KB | ########## | 100%  2025-05-07T20:25:04.5448293Z 2025-05-07T20:25:04.5448297Z 2025-05-07T20:25:04.5448301Z 2025-05-07T20:25:04.5448304Z 2025-05-07T20:25:04.5448308Z 2025-05-07T20:25:04.5448312Z 2025-05-07T20:25:04.5448316Z 2025-05-07T20:25:04.5448319Z 2025-05-07T20:25:04.5448323Z 2025-05-07T20:25:04.5448326Z 2025-05-07T20:25:04.5448330Z 2025-05-07T20:25:04.5645077Z zlib-1.2.13 | 91 KB | ########## | 100%  2025-05-07T20:25:04.5645431Z 2025-05-07T20:25:04.5645435Z 2025-05-07T20:25:04.5645438Z 2025-05-07T20:25:04.5645442Z 2025-05-07T20:25:04.5645445Z 2025-05-07T20:25:04.5645449Z 2025-05-07T20:25:04.5645453Z 2025-05-07T20:25:04.5645456Z 2025-05-07T20:25:04.5645470Z 2025-05-07T20:25:04.5645473Z 2025-05-07T20:25:04.5648509Z libxcrypt-4.4.36 | 98 KB | ########## | 100%  2025-05-07T20:25:04.5648917Z 2025-05-07T20:25:04.5648921Z 2025-05-07T20:25:04.5648924Z 2025-05-07T20:25:04.5648928Z 2025-05-07T20:25:04.5648931Z 2025-05-07T20:25:04.5648934Z 2025-05-07T20:25:04.5648938Z 2025-05-07T20:25:04.5648941Z 2025-05-07T20:25:04.5648945Z 2025-05-07T20:25:04.5651460Z 2025-05-07T20:25:04.5785411Z libxcrypt-4.4.36 | 98 KB | ########## | 100%  2025-05-07T20:25:04.5785817Z 2025-05-07T20:25:04.5785820Z 2025-05-07T20:25:04.5785832Z 2025-05-07T20:25:04.5785836Z 2025-05-07T20:25:04.5785839Z 2025-05-07T20:25:04.5785855Z 2025-05-07T20:25:04.5785859Z 2025-05-07T20:25:04.5785863Z 2025-05-07T20:25:04.5785866Z 2025-05-07T20:25:04.5785870Z 2025-05-07T20:25:04.5785873Z 2025-05-07T20:25:04.5787127Z 2025-05-07T20:25:04.5792968Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:25:04.5793402Z 2025-05-07T20:25:04.5793406Z 2025-05-07T20:25:04.5793410Z 2025-05-07T20:25:04.5793413Z 2025-05-07T20:25:04.5793417Z 2025-05-07T20:25:04.5793420Z 2025-05-07T20:25:04.5793424Z 2025-05-07T20:25:04.5793427Z 2025-05-07T20:25:04.5793431Z 2025-05-07T20:25:04.5793435Z 2025-05-07T20:25:04.5793438Z 2025-05-07T20:25:04.5794703Z 2025-05-07T20:25:04.5861669Z typing-extensions-4. | 88 KB | ########## | 100%  2025-05-07T20:25:04.5862124Z 2025-05-07T20:25:04.5967408Z openssl-3.5.0 | 3.0 MB | ########## | 100%  2025-05-07T20:25:04.5967766Z 2025-05-07T20:25:04.5967770Z 2025-05-07T20:25:04.5967774Z 2025-05-07T20:25:04.5967787Z 2025-05-07T20:25:04.5967791Z 2025-05-07T20:25:04.5967794Z 2025-05-07T20:25:04.5967798Z 2025-05-07T20:25:04.5967801Z 2025-05-07T20:25:04.5967805Z 2025-05-07T20:25:04.5967808Z 2025-05-07T20:25:04.5967812Z 2025-05-07T20:25:04.5967815Z 2025-05-07T20:25:04.5968208Z 2025-05-07T20:25:04.5974585Z libexpat-2.7.0 | 73 KB | ########## | 100%  2025-05-07T20:25:04.5975116Z 2025-05-07T20:25:04.5975120Z 2025-05-07T20:25:04.5975124Z 2025-05-07T20:25:04.5975136Z 2025-05-07T20:25:04.5975140Z 2025-05-07T20:25:04.5975143Z 2025-05-07T20:25:04.5975147Z 2025-05-07T20:25:04.5975150Z 2025-05-07T20:25:04.5975154Z 2025-05-07T20:25:04.5975157Z 2025-05-07T20:25:04.5975161Z 2025-05-07T20:25:04.5975164Z 2025-05-07T20:25:04.5977469Z 2025-05-07T20:25:04.6149896Z libexpat-2.7.0 | 73 KB | ########## | 100%  2025-05-07T20:25:04.6150316Z 2025-05-07T20:25:04.6150320Z 2025-05-07T20:25:04.6150323Z 2025-05-07T20:25:04.6150337Z 2025-05-07T20:25:04.6150340Z 2025-05-07T20:25:04.6150344Z 2025-05-07T20:25:04.6150347Z 2025-05-07T20:25:04.6150351Z 2025-05-07T20:25:04.6150354Z 2025-05-07T20:25:04.6150358Z 2025-05-07T20:25:04.6150361Z 2025-05-07T20:25:04.6150365Z 2025-05-07T20:25:04.6150368Z 2025-05-07T20:25:04.6150572Z 2025-05-07T20:25:04.6156042Z libzlib-1.2.13 | 60 KB | ########## | 100%  2025-05-07T20:25:04.6156436Z 2025-05-07T20:25:04.6156440Z 2025-05-07T20:25:04.6156443Z 2025-05-07T20:25:04.6156447Z 2025-05-07T20:25:04.6156450Z 2025-05-07T20:25:04.6156454Z 2025-05-07T20:25:04.6156457Z 2025-05-07T20:25:04.6156470Z 2025-05-07T20:25:04.6156474Z 2025-05-07T20:25:04.6156477Z 2025-05-07T20:25:04.6156481Z 2025-05-07T20:25:04.6156484Z 2025-05-07T20:25:04.6156488Z 2025-05-07T20:25:04.6160526Z 2025-05-07T20:25:04.6318090Z libzlib-1.2.13 | 60 KB | ########## | 100%  2025-05-07T20:25:04.6318529Z 2025-05-07T20:25:04.6318729Z 2025-05-07T20:25:04.6318734Z 2025-05-07T20:25:04.6318737Z 2025-05-07T20:25:04.6318741Z 2025-05-07T20:25:04.6318744Z 2025-05-07T20:25:04.6318748Z 2025-05-07T20:25:04.6318751Z 2025-05-07T20:25:04.6318755Z 2025-05-07T20:25:04.6318758Z 2025-05-07T20:25:04.6318762Z 2025-05-07T20:25:04.6318772Z 2025-05-07T20:25:04.6318775Z 2025-05-07T20:25:04.6318779Z 2025-05-07T20:25:04.6318782Z 2025-05-07T20:25:04.6327584Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:25:04.6328012Z 2025-05-07T20:25:04.6328016Z 2025-05-07T20:25:04.6328019Z 2025-05-07T20:25:04.6328023Z 2025-05-07T20:25:04.6328026Z 2025-05-07T20:25:04.6328029Z 2025-05-07T20:25:04.6328033Z 2025-05-07T20:25:04.6328036Z 2025-05-07T20:25:04.6328050Z 2025-05-07T20:25:04.6328054Z 2025-05-07T20:25:04.6328058Z 2025-05-07T20:25:04.6328061Z 2025-05-07T20:25:04.6328064Z 2025-05-07T20:25:04.6328068Z 2025-05-07T20:25:04.6328071Z 2025-05-07T20:25:04.6563947Z typing_extensions-4. | 51 KB | ########## | 100%  2025-05-07T20:25:04.6564426Z 2025-05-07T20:25:04.6564430Z 2025-05-07T20:25:04.6564434Z 2025-05-07T20:25:04.6564438Z 2025-05-07T20:25:04.6564441Z 2025-05-07T20:25:04.6564445Z 2025-05-07T20:25:04.6564448Z 2025-05-07T20:25:04.6564457Z 2025-05-07T20:25:04.6564460Z 2025-05-07T20:25:04.6564464Z 2025-05-07T20:25:04.6564468Z 2025-05-07T20:25:04.6564471Z 2025-05-07T20:25:04.6564475Z 2025-05-07T20:25:04.6564486Z 2025-05-07T20:25:04.6564490Z 2025-05-07T20:25:04.6564493Z 2025-05-07T20:25:04.6564497Z 2025-05-07T20:25:04.6574098Z libuuid-2.38.1 | 33 KB | ########## | 100%  2025-05-07T20:25:04.6574516Z 2025-05-07T20:25:04.6574519Z 2025-05-07T20:25:04.6574523Z 2025-05-07T20:25:04.6574526Z 2025-05-07T20:25:04.6574530Z 2025-05-07T20:25:04.6574533Z 2025-05-07T20:25:04.6574537Z 2025-05-07T20:25:04.6574540Z 2025-05-07T20:25:04.6574543Z 2025-05-07T20:25:04.6574554Z 2025-05-07T20:25:04.6574558Z 2025-05-07T20:25:04.6574561Z 2025-05-07T20:25:04.6574564Z 2025-05-07T20:25:04.6574568Z 2025-05-07T20:25:04.6574571Z 2025-05-07T20:25:04.6574575Z 2025-05-07T20:25:04.6574578Z 2025-05-07T20:25:04.6679722Z libuuid-2.38.1 | 33 KB | ########## | 100%  2025-05-07T20:25:04.6680192Z 2025-05-07T20:25:04.6680196Z 2025-05-07T20:25:04.6680199Z 2025-05-07T20:25:04.6680203Z 2025-05-07T20:25:04.6680206Z 2025-05-07T20:25:04.6680210Z 2025-05-07T20:25:04.6680213Z 2025-05-07T20:25:04.6680216Z 2025-05-07T20:25:04.6680220Z 2025-05-07T20:25:04.6680233Z 2025-05-07T20:25:04.6680237Z 2025-05-07T20:25:04.6680240Z 2025-05-07T20:25:04.6680244Z 2025-05-07T20:25:04.6680247Z 2025-05-07T20:25:04.6680251Z 2025-05-07T20:25:04.6682322Z 2025-05-07T20:25:04.6687870Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:04.6688324Z 2025-05-07T20:25:04.6688336Z 2025-05-07T20:25:04.6688340Z 2025-05-07T20:25:04.6688343Z 2025-05-07T20:25:04.6688347Z 2025-05-07T20:25:04.6688350Z 2025-05-07T20:25:04.6688353Z 2025-05-07T20:25:04.6688357Z 2025-05-07T20:25:04.6688360Z 2025-05-07T20:25:04.6688364Z 2025-05-07T20:25:04.6688367Z 2025-05-07T20:25:04.6688370Z 2025-05-07T20:25:04.6688545Z 2025-05-07T20:25:04.6688548Z 2025-05-07T20:25:04.6688551Z 2025-05-07T20:25:04.6688555Z 2025-05-07T20:25:04.6688558Z 2025-05-07T20:25:04.6688562Z 2025-05-07T20:25:04.6691677Z libnsl-2.0.1 | 33 KB | ########## | 100%  2025-05-07T20:25:04.6692113Z 2025-05-07T20:25:04.6692118Z 2025-05-07T20:25:04.6692123Z 2025-05-07T20:25:04.6692128Z 2025-05-07T20:25:04.6692132Z 2025-05-07T20:25:04.6692148Z 2025-05-07T20:25:04.6692152Z 2025-05-07T20:25:04.6692155Z 2025-05-07T20:25:04.6692159Z 2025-05-07T20:25:04.6692162Z 2025-05-07T20:25:04.6692166Z 2025-05-07T20:25:04.6692169Z 2025-05-07T20:25:04.6692173Z 2025-05-07T20:25:04.6692339Z 2025-05-07T20:25:04.6692343Z 2025-05-07T20:25:04.6692374Z 2025-05-07T20:25:04.6695581Z libgcc-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:04.6696010Z 2025-05-07T20:25:04.6696014Z 2025-05-07T20:25:04.6696017Z 2025-05-07T20:25:04.6696029Z 2025-05-07T20:25:04.6696033Z 2025-05-07T20:25:04.6696036Z 2025-05-07T20:25:04.6696040Z 2025-05-07T20:25:04.6696043Z 2025-05-07T20:25:04.6696047Z 2025-05-07T20:25:04.6696050Z 2025-05-07T20:25:04.6696054Z 2025-05-07T20:25:04.6696057Z 2025-05-07T20:25:04.6696061Z 2025-05-07T20:25:04.6696065Z 2025-05-07T20:25:04.6696068Z 2025-05-07T20:25:04.6696082Z 2025-05-07T20:25:04.6696085Z 2025-05-07T20:25:04.6696089Z 2025-05-07T20:25:04.6740057Z libnsl-2.0.1 | 33 KB | ########## | 100%  2025-05-07T20:25:04.6740468Z 2025-05-07T20:25:04.6740474Z 2025-05-07T20:25:04.6740479Z 2025-05-07T20:25:04.6740484Z 2025-05-07T20:25:04.6740500Z 2025-05-07T20:25:04.6740506Z 2025-05-07T20:25:04.6740511Z 2025-05-07T20:25:04.6740517Z 2025-05-07T20:25:04.6740522Z 2025-05-07T20:25:04.6740527Z 2025-05-07T20:25:04.6740532Z 2025-05-07T20:25:04.6740537Z 2025-05-07T20:25:04.6740542Z 2025-05-07T20:25:04.6740547Z 2025-05-07T20:25:04.6740559Z 2025-05-07T20:25:04.6740564Z 2025-05-07T20:25:04.6740570Z 2025-05-07T20:25:04.6740575Z 2025-05-07T20:25:04.6740580Z 2025-05-07T20:25:04.6873266Z ... (more hidden) ... 2025-05-07T20:25:05.3777071Z python-3.12.2 | 30.8 MB | ########## | 100% 2025-05-07T20:25:05.3785129Z python-3.12.2 | 30.8 MB | ########## | 100% 2025-05-07T20:25:05.3785637Z 2025-05-07T20:25:05.3785646Z 2025-05-07T20:25:05.3785653Z 2025-05-07T20:25:05.3785661Z 2025-05-07T20:25:05.3785668Z 2025-05-07T20:25:05.3785675Z 2025-05-07T20:25:05.3785682Z 2025-05-07T20:25:05.3785689Z 2025-05-07T20:25:05.3785714Z 2025-05-07T20:25:05.3785721Z 2025-05-07T20:25:05.3785764Z 2025-05-07T20:25:05.3785772Z 2025-05-07T20:25:05.3785779Z 2025-05-07T20:25:05.3785786Z 2025-05-07T20:25:05.3785794Z 2025-05-07T20:25:05.3785802Z 2025-05-07T20:25:05.3785809Z 2025-05-07T20:25:05.3785816Z 2025-05-07T20:25:05.3785823Z 2025-05-07T20:25:05.3785992Z 2025-05-07T20:25:05.3786867Z  2025-05-07T20:25:05.3787528Z 2025-05-07T20:25:05.3787933Z 2025-05-07T20:25:05.3788263Z  2025-05-07T20:25:05.3788692Z 2025-05-07T20:25:05.3788700Z 2025-05-07T20:25:05.3789028Z  2025-05-07T20:25:05.3789346Z 2025-05-07T20:25:05.3789350Z 2025-05-07T20:25:05.3789354Z 2025-05-07T20:25:05.3789533Z  2025-05-07T20:25:05.3789750Z 2025-05-07T20:25:05.3789763Z 2025-05-07T20:25:05.3789767Z 2025-05-07T20:25:05.3789777Z 2025-05-07T20:25:05.3789956Z  2025-05-07T20:25:05.3790174Z 2025-05-07T20:25:05.3790178Z 2025-05-07T20:25:05.3790181Z 2025-05-07T20:25:05.3790184Z 2025-05-07T20:25:05.3790200Z 2025-05-07T20:25:05.3790378Z  2025-05-07T20:25:05.3790851Z 2025-05-07T20:25:05.3790898Z 2025-05-07T20:25:05.3790902Z 2025-05-07T20:25:05.3790906Z 2025-05-07T20:25:05.3790909Z 2025-05-07T20:25:05.3790913Z 2025-05-07T20:25:05.3791104Z  2025-05-07T20:25:05.3791340Z 2025-05-07T20:25:05.3791344Z 2025-05-07T20:25:05.3791347Z 2025-05-07T20:25:05.3791351Z 2025-05-07T20:25:05.3791354Z 2025-05-07T20:25:05.3791357Z 2025-05-07T20:25:05.3791361Z 2025-05-07T20:25:05.3791546Z  2025-05-07T20:25:05.3791787Z 2025-05-07T20:25:05.3791790Z 2025-05-07T20:25:05.3791958Z 2025-05-07T20:25:05.3791962Z 2025-05-07T20:25:05.3791966Z 2025-05-07T20:25:05.3791969Z 2025-05-07T20:25:05.3791972Z 2025-05-07T20:25:05.3791976Z 2025-05-07T20:25:05.3792171Z  2025-05-07T20:25:05.3792409Z 2025-05-07T20:25:05.3792422Z 2025-05-07T20:25:05.3792426Z 2025-05-07T20:25:05.3792429Z 2025-05-07T20:25:05.3792433Z 2025-05-07T20:25:05.3792436Z 2025-05-07T20:25:05.3792440Z 2025-05-07T20:25:05.3792443Z 2025-05-07T20:25:05.3792446Z 2025-05-07T20:25:05.3792641Z  2025-05-07T20:25:05.3792874Z 2025-05-07T20:25:05.3792878Z 2025-05-07T20:25:05.3792881Z 2025-05-07T20:25:05.3792885Z 2025-05-07T20:25:05.3792896Z 2025-05-07T20:25:05.3792899Z 2025-05-07T20:25:05.3792903Z 2025-05-07T20:25:05.3792906Z 2025-05-07T20:25:05.3792910Z 2025-05-07T20:25:05.3792913Z 2025-05-07T20:25:05.3793112Z  2025-05-07T20:25:05.3793353Z 2025-05-07T20:25:05.3793356Z 2025-05-07T20:25:05.3793360Z 2025-05-07T20:25:05.3793363Z 2025-05-07T20:25:05.3793367Z 2025-05-07T20:25:05.3793370Z 2025-05-07T20:25:05.3793374Z 2025-05-07T20:25:05.3793377Z 2025-05-07T20:25:05.3793385Z 2025-05-07T20:25:05.3793388Z 2025-05-07T20:25:05.3793392Z 2025-05-07T20:25:05.3793592Z  2025-05-07T20:25:05.3793834Z 2025-05-07T20:25:05.3793838Z 2025-05-07T20:25:05.3793842Z 2025-05-07T20:25:05.3793845Z 2025-05-07T20:25:05.3793849Z 2025-05-07T20:25:05.3793852Z 2025-05-07T20:25:05.3793855Z 2025-05-07T20:25:05.3793859Z 2025-05-07T20:25:05.3793862Z 2025-05-07T20:25:05.3793866Z 2025-05-07T20:25:05.3793869Z 2025-05-07T20:25:05.3793873Z 2025-05-07T20:25:05.3794077Z  2025-05-07T20:25:05.3794316Z 2025-05-07T20:25:05.3794325Z 2025-05-07T20:25:05.3794328Z 2025-05-07T20:25:05.3794331Z 2025-05-07T20:25:05.3794335Z 2025-05-07T20:25:05.3794338Z 2025-05-07T20:25:05.3794361Z 2025-05-07T20:25:05.3794374Z 2025-05-07T20:25:05.3794377Z 2025-05-07T20:25:05.3794381Z 2025-05-07T20:25:05.3794384Z 2025-05-07T20:25:05.3794388Z 2025-05-07T20:25:05.3794396Z 2025-05-07T20:25:05.3794596Z  2025-05-07T20:25:05.3794828Z 2025-05-07T20:25:05.3794840Z 2025-05-07T20:25:05.3794843Z 2025-05-07T20:25:05.3794847Z 2025-05-07T20:25:05.3794850Z 2025-05-07T20:25:05.3794854Z 2025-05-07T20:25:05.3794857Z 2025-05-07T20:25:05.3794861Z 2025-05-07T20:25:05.3794864Z 2025-05-07T20:25:05.3794868Z 2025-05-07T20:25:05.3794871Z 2025-05-07T20:25:05.3794875Z 2025-05-07T20:25:05.3794878Z 2025-05-07T20:25:05.3794882Z 2025-05-07T20:25:05.3795088Z  2025-05-07T20:25:05.3795340Z 2025-05-07T20:25:05.3795344Z 2025-05-07T20:25:05.3795347Z 2025-05-07T20:25:05.3795351Z 2025-05-07T20:25:05.3795354Z 2025-05-07T20:25:05.3795357Z 2025-05-07T20:25:05.3795361Z 2025-05-07T20:25:05.3795364Z 2025-05-07T20:25:05.3795368Z 2025-05-07T20:25:05.3795371Z 2025-05-07T20:25:05.3795460Z 2025-05-07T20:25:05.3795464Z 2025-05-07T20:25:05.3795467Z 2025-05-07T20:25:05.3795471Z 2025-05-07T20:25:05.3795474Z 2025-05-07T20:25:05.3795693Z  2025-05-07T20:25:05.3795933Z 2025-05-07T20:25:05.3795937Z 2025-05-07T20:25:05.3795940Z 2025-05-07T20:25:05.3795944Z 2025-05-07T20:25:05.3795947Z 2025-05-07T20:25:05.3795951Z 2025-05-07T20:25:05.3795954Z 2025-05-07T20:25:05.3795958Z 2025-05-07T20:25:05.3795969Z 2025-05-07T20:25:05.3795972Z 2025-05-07T20:25:05.3795976Z 2025-05-07T20:25:05.3795979Z 2025-05-07T20:25:05.3795983Z 2025-05-07T20:25:05.3795986Z 2025-05-07T20:25:05.3795989Z 2025-05-07T20:25:05.3796076Z 2025-05-07T20:25:05.3796287Z  2025-05-07T20:25:05.3796538Z 2025-05-07T20:25:05.3796541Z 2025-05-07T20:25:05.3796544Z 2025-05-07T20:25:05.3796548Z 2025-05-07T20:25:05.3796551Z 2025-05-07T20:25:05.3796562Z 2025-05-07T20:25:05.3796565Z 2025-05-07T20:25:05.3796569Z 2025-05-07T20:25:05.3796572Z 2025-05-07T20:25:05.3796576Z 2025-05-07T20:25:05.3796579Z 2025-05-07T20:25:05.3796583Z 2025-05-07T20:25:05.3796586Z 2025-05-07T20:25:05.3796590Z 2025-05-07T20:25:05.3796593Z 2025-05-07T20:25:05.3796597Z 2025-05-07T20:25:05.3796600Z 2025-05-07T20:25:05.3796830Z  2025-05-07T20:25:05.3797073Z 2025-05-07T20:25:05.3797077Z 2025-05-07T20:25:05.3797080Z 2025-05-07T20:25:05.3797083Z 2025-05-07T20:25:05.3797087Z 2025-05-07T20:25:05.3797090Z 2025-05-07T20:25:05.3797094Z 2025-05-07T20:25:05.3797103Z 2025-05-07T20:25:05.3797107Z 2025-05-07T20:25:05.3797110Z 2025-05-07T20:25:05.3797114Z 2025-05-07T20:25:05.3797127Z 2025-05-07T20:25:05.3797131Z 2025-05-07T20:25:05.3797134Z 2025-05-07T20:25:05.3797137Z 2025-05-07T20:25:05.3797141Z 2025-05-07T20:25:05.3797144Z 2025-05-07T20:25:05.3797148Z 2025-05-07T20:25:05.3797376Z  2025-05-07T20:25:05.3797636Z 2025-05-07T20:25:05.3797718Z done 2025-05-07T20:25:05.4794709Z Preparing transaction: \ done 2025-05-07T20:25:06.2344450Z Verifying transaction: / - \ | / - done 2025-05-07T20:25:07.9400342Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | done 2025-05-07T20:25:08.2996878Z [SETUP] Testing pyOpenSSL import ... 2025-05-07T20:25:10.0596901Z [CHECK] Python (sub-)package 'OpenSSL' found ... 2025-05-07T20:25:10.0610518Z [SETUP] Installing libxcrypt ... 2025-05-07T20:25:10.0635696Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt 2025-05-07T20:25:10.9275511Z Channels: 2025-05-07T20:25:10.9275924Z - conda-forge 2025-05-07T20:25:10.9276228Z Platform: linux-64 2025-05-07T20:25:14.3680634Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:14.7385468Z Solving environment: \ done 2025-05-07T20:25:14.7750521Z 2025-05-07T20:25:14.7750833Z # All requested packages already installed. 2025-05-07T20:25:14.7751091Z 2025-05-07T20:25:18.1871424Z [SETUP] Copying over ... 2025-05-07T20:25:18.1872212Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.12/crypt.h 2025-05-07T20:25:18.1872795Z 2025-05-07T20:25:18.1906765Z 2025-05-07T20:25:19.8421284Z [SETUP] Installed Python version: Python 3.12.2 2025-05-07T20:25:19.8421748Z [SETUP] Successfully created Conda environment: build_binary 2025-05-07T20:25:19.8456382Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:25:19.8456868Z . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc 2025-05-07T20:25:19.8470513Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:19.8470870Z env: 2025-05-07T20:25:19.8471090Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:19.8471622Z BUILD_ENV: build_binary 2025-05-07T20:25:19.8471883Z BUILD_TARGET: genai 2025-05-07T20:25:19.8472118Z BUILD_VARIANT: cuda 2025-05-07T20:25:19.8472369Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:25:19.8472634Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:19.8472952Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:19.8473331Z ##[endgroup] 2025-05-07T20:25:20.1864782Z ################################################################################ 2025-05-07T20:25:20.1865251Z # Install C/C++ Compilers 2025-05-07T20:25:20.1865520Z # 2025-05-07T20:25:20.1880840Z # [2025-05-07T20:25:20.187Z] + install_cxx_compiler build_binary gcc 2025-05-07T20:25:20.1881276Z ################################################################################ 2025-05-07T20:25:20.1881512Z 2025-05-07T20:25:20.1898115Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:20.2770330Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:20.2780148Z [INSTALL] Installing GLIBC (architecture = 64) ... 2025-05-07T20:25:20.2800708Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17 2025-05-07T20:25:21.1508293Z Channels: 2025-05-07T20:25:21.1508566Z - conda-forge 2025-05-07T20:25:21.1508812Z Platform: linux-64 2025-05-07T20:25:24.5883301Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:24.9598218Z Solving environment: \ done 2025-05-07T20:25:25.0233864Z 2025-05-07T20:25:25.0234160Z ## Package Plan ## 2025-05-07T20:25:25.0234304Z 2025-05-07T20:25:25.0234541Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:25.0234872Z 2025-05-07T20:25:25.0234979Z added / updated specs: 2025-05-07T20:25:25.0235273Z - sysroot_linux-64=2.17 2025-05-07T20:25:25.0235466Z 2025-05-07T20:25:25.0235470Z 2025-05-07T20:25:25.0235592Z The following packages will be downloaded: 2025-05-07T20:25:25.0235812Z 2025-05-07T20:25:25.0235953Z package | build 2025-05-07T20:25:25.0236278Z ---------------------------|----------------- 2025-05-07T20:25:25.0236717Z kernel-headers_linux-64-3.10.0| he073ed8_18 921 KB conda-forge 2025-05-07T20:25:25.0237221Z sysroot_linux-64-2.17 | h0157908_18 14.5 MB conda-forge 2025-05-07T20:25:25.0237645Z ------------------------------------------------------------ 2025-05-07T20:25:25.0237995Z Total: 15.4 MB 2025-05-07T20:25:25.0238221Z 2025-05-07T20:25:25.0238350Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:25.0238583Z 2025-05-07T20:25:25.0238880Z kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 2025-05-07T20:25:25.0239475Z sysroot_linux-64 conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 2025-05-07T20:25:25.0239794Z 2025-05-07T20:25:25.0239798Z 2025-05-07T20:25:25.0239802Z 2025-05-07T20:25:25.0239951Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:25.0240349Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:25:25.0240596Z 2025-05-07T20:25:25.2247151Z kernel-headers_linux | 921 KB | | 0%  2025-05-07T20:25:25.2660207Z sysroot_linux-64-2.1 | 14.5 MB | | 0% 2025-05-07T20:25:25.2660462Z 2025-05-07T20:25:25.2974484Z kernel-headers_linux | 921 KB | 1 | 2%  2025-05-07T20:25:25.2975157Z 2025-05-07T20:25:25.3674746Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:25.4675919Z sysroot_linux-64-2.1 | 14.5 MB | 6 | 6% 2025-05-07T20:25:25.5123950Z sysroot_linux-64-2.1 | 14.5 MB | ######2 | 62% 2025-05-07T20:25:25.5124234Z 2025-05-07T20:25:25.5124672Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:25.5124960Z 2025-05-07T20:25:25.5599000Z kernel-headers_linux | 921 KB | ########## | 100%  2025-05-07T20:25:26.0839536Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:26.0840387Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:26.0844814Z sysroot_linux-64-2.1 | 14.5 MB | ########## | 100% 2025-05-07T20:25:26.0845326Z 2025-05-07T20:25:26.0845644Z 2025-05-07T20:25:26.0846008Z  done 2025-05-07T20:25:26.1850350Z Preparing transaction: / done 2025-05-07T20:25:26.3858152Z Verifying transaction: \ | done 2025-05-07T20:25:26.5907753Z Executing transaction: - \ done 2025-05-07T20:25:26.7480893Z [CHECK] LD_LIBRARY_PATH = 2025-05-07T20:25:26.7481281Z [CHECK] CONDA_PREFIX is not set. 2025-05-07T20:25:28.4425969Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6 2025-05-07T20:25:28.4439908Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ... 2025-05-07T20:25:28.4461248Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0 2025-05-07T20:25:29.3339658Z Channels: 2025-05-07T20:25:29.3339894Z - conda-forge 2025-05-07T20:25:29.3340136Z Platform: linux-64 2025-05-07T20:25:32.7714400Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:25:33.7516291Z Solving environment: \ | / done 2025-05-07T20:25:33.8178617Z 2025-05-07T20:25:33.8178963Z ## Package Plan ## 2025-05-07T20:25:33.8179124Z 2025-05-07T20:25:33.8179335Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:25:33.8179690Z 2025-05-07T20:25:33.8179787Z added / updated specs: 2025-05-07T20:25:33.8180075Z - gxx_linux-64=11.4.0 2025-05-07T20:25:33.8180241Z 2025-05-07T20:25:33.8180245Z 2025-05-07T20:25:33.8180377Z The following packages will be downloaded: 2025-05-07T20:25:33.8180600Z 2025-05-07T20:25:33.8180717Z package | build 2025-05-07T20:25:33.8181052Z ---------------------------|----------------- 2025-05-07T20:25:33.8181487Z binutils_impl_linux-64-2.40| ha1999f0_7 6.0 MB conda-forge 2025-05-07T20:25:33.8181975Z binutils_linux-64-2.40 | hb3c18ed_4 28 KB conda-forge 2025-05-07T20:25:33.8182449Z gcc_impl_linux-64-11.4.0 | h00c12a0_13 53.0 MB conda-forge 2025-05-07T20:25:33.8182897Z gcc_linux-64-11.4.0 | ha077dfb_4 31 KB conda-forge 2025-05-07T20:25:33.8183347Z gxx_impl_linux-64-11.4.0 | h634f3ee_13 11.2 MB conda-forge 2025-05-07T20:25:33.8183788Z gxx_linux-64-11.4.0 | h35bfe5d_4 29 KB conda-forge 2025-05-07T20:25:33.8184237Z ld_impl_linux-64-2.40 | hf3520f5_7 691 KB conda-forge 2025-05-07T20:25:33.8184719Z libgcc-devel_linux-64-11.4.0| h8f596e0_113 2.3 MB conda-forge 2025-05-07T20:25:33.8185209Z libsanitizer-11.4.0 | h5763a12_13 3.5 MB conda-forge 2025-05-07T20:25:33.8185655Z libstdcxx-15.1.0 | h8f9b012_2 3.7 MB conda-forge 2025-05-07T20:25:33.8186146Z libstdcxx-devel_linux-64-11.4.0| h8f596e0_113 11.1 MB conda-forge 2025-05-07T20:25:33.8186647Z libstdcxx-ng-15.1.0 | h4852527_2 34 KB conda-forge 2025-05-07T20:25:33.8187107Z ------------------------------------------------------------ 2025-05-07T20:25:33.8187462Z Total: 91.6 MB 2025-05-07T20:25:33.8187678Z 2025-05-07T20:25:33.8187814Z The following NEW packages will be INSTALLED: 2025-05-07T20:25:33.8188038Z 2025-05-07T20:25:33.8188581Z binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 2025-05-07T20:25:33.8189165Z binutils_linux-64 conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 2025-05-07T20:25:33.8189736Z gcc_impl_linux-64 conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 2025-05-07T20:25:33.8190274Z gcc_linux-64 conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 2025-05-07T20:25:33.8190954Z gxx_impl_linux-64 conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 2025-05-07T20:25:33.8191476Z gxx_linux-64 conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 2025-05-07T20:25:33.8192026Z libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:25:33.8192611Z libsanitizer conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 2025-05-07T20:25:33.8193137Z libstdcxx conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 2025-05-07T20:25:33.8193698Z libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 2025-05-07T20:25:33.8194091Z 2025-05-07T20:25:33.8194209Z The following packages will be UPDATED: 2025-05-07T20:25:33.8194423Z 2025-05-07T20:25:33.8194750Z ld_impl_linux-64 pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 2025-05-07T20:25:33.8195493Z libstdcxx-ng pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 2025-05-07T20:25:33.8195925Z 2025-05-07T20:25:33.8195929Z 2025-05-07T20:25:33.8195933Z 2025-05-07T20:25:33.8196089Z Downloading and Extracting Packages: ...working... 2025-05-07T20:25:33.8196478Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:25:33.8196720Z 2025-05-07T20:25:33.8196969Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:25:33.8197210Z 2025-05-07T20:25:33.8197214Z 2025-05-07T20:25:33.8198055Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:25:33.8198318Z 2025-05-07T20:25:33.8198329Z 2025-05-07T20:25:33.8198332Z 2025-05-07T20:25:33.8207320Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:25:33.8207602Z 2025-05-07T20:25:33.8207606Z 2025-05-07T20:25:33.8207610Z 2025-05-07T20:25:33.8208213Z 2025-05-07T20:25:33.8249198Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:25:33.8249489Z 2025-05-07T20:25:33.8249493Z 2025-05-07T20:25:33.8249496Z 2025-05-07T20:25:33.8249509Z 2025-05-07T20:25:33.8249512Z 2025-05-07T20:25:33.8250802Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:25:33.8251100Z 2025-05-07T20:25:33.8251103Z 2025-05-07T20:25:33.8251107Z 2025-05-07T20:25:33.8251111Z 2025-05-07T20:25:33.8251114Z 2025-05-07T20:25:33.8251118Z 2025-05-07T20:25:33.8252507Z libgcc-devel_linux-6 | 2.3 MB | | 0%  2025-05-07T20:25:33.8252818Z 2025-05-07T20:25:33.8252821Z 2025-05-07T20:25:33.8252825Z 2025-05-07T20:25:33.8252828Z 2025-05-07T20:25:33.8252832Z 2025-05-07T20:25:33.8252836Z 2025-05-07T20:25:33.8252839Z 2025-05-07T20:25:33.8260718Z ld_impl_linux-64-2.4 | 691 KB | | 0%  2025-05-07T20:25:33.8261118Z 2025-05-07T20:25:33.8261124Z 2025-05-07T20:25:33.8261129Z 2025-05-07T20:25:33.8261134Z 2025-05-07T20:25:33.8261138Z 2025-05-07T20:25:33.8261144Z 2025-05-07T20:25:33.8261149Z 2025-05-07T20:25:33.8267074Z 2025-05-07T20:25:33.8268055Z libstdcxx-ng-15.1.0 | 34 KB | | 0%  2025-05-07T20:25:33.8268358Z 2025-05-07T20:25:33.8268362Z 2025-05-07T20:25:33.8268378Z 2025-05-07T20:25:33.8268390Z 2025-05-07T20:25:33.8268394Z 2025-05-07T20:25:33.8268398Z 2025-05-07T20:25:33.8268401Z 2025-05-07T20:25:33.8268404Z 2025-05-07T20:25:33.8268408Z 2025-05-07T20:25:33.8274651Z gcc_linux-64-11.4.0 | 31 KB | | 0%  2025-05-07T20:25:33.8274962Z 2025-05-07T20:25:33.8274965Z 2025-05-07T20:25:33.8274969Z 2025-05-07T20:25:33.8274972Z 2025-05-07T20:25:33.8274976Z 2025-05-07T20:25:33.8274979Z 2025-05-07T20:25:33.8275195Z 2025-05-07T20:25:33.8275200Z 2025-05-07T20:25:33.8275203Z 2025-05-07T20:25:33.8275207Z 2025-05-07T20:25:33.8275816Z gxx_linux-64-11.4.0 | 29 KB | | 0%  2025-05-07T20:25:33.8276149Z 2025-05-07T20:25:33.8276155Z 2025-05-07T20:25:33.8276160Z 2025-05-07T20:25:33.8276175Z 2025-05-07T20:25:33.8276180Z 2025-05-07T20:25:33.8276364Z 2025-05-07T20:25:33.8276368Z 2025-05-07T20:25:33.8276371Z 2025-05-07T20:25:33.8276375Z 2025-05-07T20:25:33.8276378Z 2025-05-07T20:25:33.8276382Z 2025-05-07T20:25:34.2020828Z binutils_linux-64-2. | 28 KB | | 0%  2025-05-07T20:25:34.2021165Z 2025-05-07T20:25:34.2088394Z 2025-05-07T20:25:34.2510897Z libstdcxx-devel_linu | 11.1 MB | | 0%  2025-05-07T20:25:34.2511319Z 2025-05-07T20:25:34.2511327Z 2025-05-07T20:25:34.2511332Z 2025-05-07T20:25:34.2511704Z 2025-05-07T20:25:34.2520026Z libstdcxx-15.1.0 | 3.7 MB | | 0%  2025-05-07T20:25:34.2773106Z gcc_impl_linux-64-11 | 53.0 MB | | 0% 2025-05-07T20:25:34.2773572Z 2025-05-07T20:25:34.3020777Z gxx_impl_linux-64-11 | 11.2 MB | | 0%  2025-05-07T20:25:34.3021044Z 2025-05-07T20:25:34.3021049Z 2025-05-07T20:25:34.3500185Z libstdcxx-devel_linu | 11.1 MB | ##7 | 28%  2025-05-07T20:25:34.3500512Z 2025-05-07T20:25:34.3500539Z 2025-05-07T20:25:34.3501370Z 2025-05-07T20:25:34.3520273Z binutils_impl_linux- | 6.0 MB | | 0%  2025-05-07T20:25:34.3772381Z gcc_impl_linux-64-11 | 53.0 MB | #2 | 13% 2025-05-07T20:25:34.3772701Z 2025-05-07T20:25:34.3772797Z 2025-05-07T20:25:34.3772802Z 2025-05-07T20:25:34.3778863Z 2025-05-07T20:25:34.3786005Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:34.3786388Z 2025-05-07T20:25:34.3786394Z 2025-05-07T20:25:34.3786399Z 2025-05-07T20:25:34.3794591Z 2025-05-07T20:25:34.3805027Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:34.3806512Z 2025-05-07T20:25:34.4023962Z gxx_impl_linux-64-11 | 11.2 MB | ####2 | 42%  2025-05-07T20:25:34.4024315Z 2025-05-07T20:25:34.4025066Z 2025-05-07T20:25:34.4213267Z libstdcxx-devel_linu | 11.1 MB | #######2 | 73%  2025-05-07T20:25:34.4213659Z 2025-05-07T20:25:34.4213665Z 2025-05-07T20:25:34.4213670Z 2025-05-07T20:25:34.4213692Z 2025-05-07T20:25:34.4214485Z 2025-05-07T20:25:34.4500818Z libsanitizer-11.4.0 | 3.5 MB | | 0%  2025-05-07T20:25:34.4501240Z 2025-05-07T20:25:34.4501245Z 2025-05-07T20:25:34.4501250Z 2025-05-07T20:25:34.4523508Z binutils_impl_linux- | 6.0 MB | ####9 | 50%  2025-05-07T20:25:34.4782987Z gcc_impl_linux-64-11 | 53.0 MB | ## | 21% 2025-05-07T20:25:34.4785010Z 2025-05-07T20:25:34.5218388Z gxx_impl_linux-64-11 | 11.2 MB | #######1 | 71%  2025-05-07T20:25:34.5218749Z 2025-05-07T20:25:34.5218755Z 2025-05-07T20:25:34.5218761Z 2025-05-07T20:25:34.5218784Z 2025-05-07T20:25:34.5219783Z 2025-05-07T20:25:34.5597330Z libsanitizer-11.4.0 | 3.5 MB | ########8 | 89%  2025-05-07T20:25:34.6244461Z gcc_impl_linux-64-11 | 53.0 MB | ##8 | 28% 2025-05-07T20:25:34.6244815Z 2025-05-07T20:25:34.6244821Z 2025-05-07T20:25:34.6244826Z 2025-05-07T20:25:34.6244831Z 2025-05-07T20:25:34.6250059Z 2025-05-07T20:25:34.6597867Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:25:34.6800577Z gcc_impl_linux-64-11 | 53.0 MB | ###6 | 37% 2025-05-07T20:25:34.6800941Z 2025-05-07T20:25:34.6801072Z 2025-05-07T20:25:34.6801078Z 2025-05-07T20:25:34.6801083Z 2025-05-07T20:25:34.6801088Z 2025-05-07T20:25:34.6803529Z 2025-05-07T20:25:34.6984470Z libgcc-devel_linux-6 | 2.3 MB | | 1%  2025-05-07T20:25:34.6984882Z 2025-05-07T20:25:34.6984896Z 2025-05-07T20:25:34.6987721Z 2025-05-07T20:25:34.6995844Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:34.6996496Z 2025-05-07T20:25:34.6996503Z 2025-05-07T20:25:34.6997960Z 2025-05-07T20:25:34.7579812Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:34.7580201Z 2025-05-07T20:25:34.7580207Z 2025-05-07T20:25:34.7580212Z 2025-05-07T20:25:34.7580217Z 2025-05-07T20:25:34.7580232Z 2025-05-07T20:25:34.7580237Z 2025-05-07T20:25:34.7582341Z 2025-05-07T20:25:34.7861124Z ld_impl_linux-64-2.4 | 691 KB | 2 | 2%  2025-05-07T20:25:34.7926363Z gcc_impl_linux-64-11 | 53.0 MB | ####4 | 45% 2025-05-07T20:25:34.7926716Z 2025-05-07T20:25:34.7928906Z 2025-05-07T20:25:34.8097006Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:25:34.8097451Z 2025-05-07T20:25:34.8097457Z 2025-05-07T20:25:34.8097461Z 2025-05-07T20:25:34.8097467Z 2025-05-07T20:25:34.8097472Z 2025-05-07T20:25:34.8097477Z 2025-05-07T20:25:34.8099134Z 2025-05-07T20:25:34.8422456Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:34.8422867Z 2025-05-07T20:25:34.8422873Z 2025-05-07T20:25:34.8422878Z 2025-05-07T20:25:34.8422883Z 2025-05-07T20:25:34.8422887Z 2025-05-07T20:25:34.8422892Z 2025-05-07T20:25:34.8426233Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:34.8426634Z 2025-05-07T20:25:34.8426640Z 2025-05-07T20:25:34.8426644Z 2025-05-07T20:25:34.8426663Z 2025-05-07T20:25:34.8426667Z 2025-05-07T20:25:34.8427412Z 2025-05-07T20:25:34.8604370Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:34.8604806Z 2025-05-07T20:25:34.8604811Z 2025-05-07T20:25:34.8604817Z 2025-05-07T20:25:34.8604822Z 2025-05-07T20:25:34.8604835Z 2025-05-07T20:25:34.8604840Z 2025-05-07T20:25:34.8604845Z 2025-05-07T20:25:34.8605610Z 2025-05-07T20:25:34.8647169Z libstdcxx-ng-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:25:34.8647580Z 2025-05-07T20:25:34.8647586Z 2025-05-07T20:25:34.8647591Z 2025-05-07T20:25:34.8647608Z 2025-05-07T20:25:34.8647613Z 2025-05-07T20:25:34.8647618Z 2025-05-07T20:25:34.8647623Z 2025-05-07T20:25:34.8647632Z 2025-05-07T20:25:34.8758744Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:34.8759157Z 2025-05-07T20:25:34.8759162Z 2025-05-07T20:25:34.8759167Z 2025-05-07T20:25:34.8761733Z 2025-05-07T20:25:34.8786279Z libstdcxx-15.1.0 | 3.7 MB | ########## | 100%  2025-05-07T20:25:34.8786620Z 2025-05-07T20:25:34.8786925Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:25:34.8787191Z 2025-05-07T20:25:34.8801999Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:25:34.8802291Z 2025-05-07T20:25:34.8802301Z 2025-05-07T20:25:34.8802305Z 2025-05-07T20:25:34.8802309Z 2025-05-07T20:25:34.8802312Z 2025-05-07T20:25:34.8802316Z 2025-05-07T20:25:34.8802320Z 2025-05-07T20:25:34.8802324Z 2025-05-07T20:25:34.8802821Z 2025-05-07T20:25:34.8838655Z gcc_linux-64-11.4.0 | 31 KB | #####2 | 52%  2025-05-07T20:25:34.8838962Z 2025-05-07T20:25:34.8838966Z 2025-05-07T20:25:34.8838970Z 2025-05-07T20:25:34.8838973Z 2025-05-07T20:25:34.8838977Z 2025-05-07T20:25:34.8838980Z 2025-05-07T20:25:34.8838984Z 2025-05-07T20:25:34.8838987Z 2025-05-07T20:25:34.8842329Z 2025-05-07T20:25:34.8876897Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:34.8877196Z 2025-05-07T20:25:34.8877200Z 2025-05-07T20:25:34.8877204Z 2025-05-07T20:25:34.8877207Z 2025-05-07T20:25:34.8877211Z 2025-05-07T20:25:34.8877214Z 2025-05-07T20:25:34.8877218Z 2025-05-07T20:25:34.8877221Z 2025-05-07T20:25:34.8877225Z 2025-05-07T20:25:34.8877481Z 2025-05-07T20:25:34.8893318Z gxx_linux-64-11.4.0 | 29 KB | #####5 | 55%  2025-05-07T20:25:34.8914658Z gcc_impl_linux-64-11 | 53.0 MB | #####2 | 52% 2025-05-07T20:25:34.8914935Z 2025-05-07T20:25:34.8914942Z 2025-05-07T20:25:34.8914947Z 2025-05-07T20:25:34.8915203Z 2025-05-07T20:25:34.8915212Z 2025-05-07T20:25:34.8915217Z 2025-05-07T20:25:34.8915223Z 2025-05-07T20:25:34.8915229Z 2025-05-07T20:25:34.8915235Z 2025-05-07T20:25:34.8916618Z 2025-05-07T20:25:34.9034685Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:34.9035002Z 2025-05-07T20:25:34.9035007Z 2025-05-07T20:25:34.9035552Z 2025-05-07T20:25:34.9035556Z 2025-05-07T20:25:34.9035560Z 2025-05-07T20:25:34.9035563Z 2025-05-07T20:25:34.9035567Z 2025-05-07T20:25:34.9035570Z 2025-05-07T20:25:34.9035574Z 2025-05-07T20:25:34.9035577Z 2025-05-07T20:25:34.9036153Z 2025-05-07T20:25:34.9074169Z binutils_linux-64-2. | 28 KB | #####6 | 56%  2025-05-07T20:25:34.9074493Z 2025-05-07T20:25:34.9074497Z 2025-05-07T20:25:34.9074500Z 2025-05-07T20:25:34.9074504Z 2025-05-07T20:25:34.9074507Z 2025-05-07T20:25:34.9074511Z 2025-05-07T20:25:34.9074514Z 2025-05-07T20:25:34.9074518Z 2025-05-07T20:25:34.9074521Z 2025-05-07T20:25:34.9074536Z 2025-05-07T20:25:34.9075160Z 2025-05-07T20:25:34.9895516Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:35.0575762Z gcc_impl_linux-64-11 | 53.0 MB | ######6 | 67% 2025-05-07T20:25:35.0576014Z 2025-05-07T20:25:35.0576018Z 2025-05-07T20:25:35.0576022Z 2025-05-07T20:25:35.0576025Z 2025-05-07T20:25:35.0576458Z 2025-05-07T20:25:35.0896609Z libsanitizer-11.4.0 | 3.5 MB | ########## | 100%  2025-05-07T20:25:35.1326998Z gcc_impl_linux-64-11 | 53.0 MB | ########1 | 82% 2025-05-07T20:25:35.1327476Z 2025-05-07T20:25:35.1327484Z 2025-05-07T20:25:35.1327491Z 2025-05-07T20:25:35.1327498Z 2025-05-07T20:25:35.1327506Z 2025-05-07T20:25:35.1327511Z 2025-05-07T20:25:35.1327514Z 2025-05-07T20:25:35.1335908Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:35.1336195Z 2025-05-07T20:25:35.1336199Z 2025-05-07T20:25:35.1336203Z 2025-05-07T20:25:35.1336207Z 2025-05-07T20:25:35.1336221Z 2025-05-07T20:25:35.1336225Z 2025-05-07T20:25:35.1336229Z 2025-05-07T20:25:35.1899093Z ld_impl_linux-64-2.4 | 691 KB | ########## | 100%  2025-05-07T20:25:35.3246167Z gcc_impl_linux-64-11 | 53.0 MB | #########4 | 95% 2025-05-07T20:25:35.3246542Z 2025-05-07T20:25:35.3246546Z 2025-05-07T20:25:35.3246550Z 2025-05-07T20:25:35.3246584Z 2025-05-07T20:25:35.3246587Z 2025-05-07T20:25:35.3246591Z 2025-05-07T20:25:35.3927488Z libgcc-devel_linux-6 | 2.3 MB | ########## | 100%  2025-05-07T20:25:35.3927959Z 2025-05-07T20:25:35.3927965Z 2025-05-07T20:25:35.3927972Z 2025-05-07T20:25:35.3927978Z 2025-05-07T20:25:35.3927984Z 2025-05-07T20:25:35.3928006Z 2025-05-07T20:25:35.3928013Z 2025-05-07T20:25:35.3928022Z 2025-05-07T20:25:35.3931491Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:35.3931964Z 2025-05-07T20:25:35.3931970Z 2025-05-07T20:25:35.3931986Z 2025-05-07T20:25:35.3931993Z 2025-05-07T20:25:35.3932036Z 2025-05-07T20:25:35.3932043Z 2025-05-07T20:25:35.3932049Z 2025-05-07T20:25:35.3932055Z 2025-05-07T20:25:35.4757568Z libstdcxx-ng-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:25:35.4757904Z 2025-05-07T20:25:35.4757908Z 2025-05-07T20:25:35.4757912Z 2025-05-07T20:25:35.5181581Z binutils_impl_linux- | 6.0 MB | ########## | 100%  2025-05-07T20:25:35.5182068Z 2025-05-07T20:25:35.5182075Z 2025-05-07T20:25:35.5182080Z 2025-05-07T20:25:35.5182085Z 2025-05-07T20:25:35.5182091Z 2025-05-07T20:25:35.5182096Z 2025-05-07T20:25:35.5182101Z 2025-05-07T20:25:35.5182106Z 2025-05-07T20:25:35.5182110Z 2025-05-07T20:25:35.5187266Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:35.5187717Z 2025-05-07T20:25:35.5187721Z 2025-05-07T20:25:35.5187725Z 2025-05-07T20:25:35.5187728Z 2025-05-07T20:25:35.5187736Z 2025-05-07T20:25:35.5187742Z 2025-05-07T20:25:35.5187747Z 2025-05-07T20:25:35.5187753Z 2025-05-07T20:25:35.5188055Z 2025-05-07T20:25:35.5548425Z gcc_linux-64-11.4.0 | 31 KB | ########## | 100%  2025-05-07T20:25:35.5560769Z 2025-05-07T20:25:35.5560779Z 2025-05-07T20:25:35.5560786Z 2025-05-07T20:25:35.5560792Z 2025-05-07T20:25:35.5560798Z 2025-05-07T20:25:35.5560816Z 2025-05-07T20:25:35.5560822Z 2025-05-07T20:25:35.5561169Z 2025-05-07T20:25:35.5561173Z 2025-05-07T20:25:35.5561179Z 2025-05-07T20:25:35.5561846Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:35.5562285Z 2025-05-07T20:25:35.5562291Z 2025-05-07T20:25:35.5562297Z 2025-05-07T20:25:35.5562302Z 2025-05-07T20:25:35.5562305Z 2025-05-07T20:25:35.5562320Z 2025-05-07T20:25:35.5562323Z 2025-05-07T20:25:35.5562327Z 2025-05-07T20:25:35.5562330Z 2025-05-07T20:25:35.5846603Z 2025-05-07T20:25:35.5847412Z gxx_linux-64-11.4.0 | 29 KB | ########## | 100%  2025-05-07T20:25:35.5847756Z 2025-05-07T20:25:35.5847760Z 2025-05-07T20:25:35.5847797Z 2025-05-07T20:25:35.5847801Z 2025-05-07T20:25:35.5847805Z 2025-05-07T20:25:35.5847808Z 2025-05-07T20:25:35.5847812Z 2025-05-07T20:25:35.5847815Z 2025-05-07T20:25:35.5847820Z 2025-05-07T20:25:35.5847824Z 2025-05-07T20:25:35.5848153Z 2025-05-07T20:25:35.5853743Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:35.5854156Z 2025-05-07T20:25:35.5854161Z 2025-05-07T20:25:35.5854166Z 2025-05-07T20:25:35.5854171Z 2025-05-07T20:25:35.5854176Z 2025-05-07T20:25:35.5854181Z 2025-05-07T20:25:35.5854186Z 2025-05-07T20:25:35.5854192Z 2025-05-07T20:25:35.5854197Z 2025-05-07T20:25:35.5854202Z 2025-05-07T20:25:35.5854207Z 2025-05-07T20:25:35.6196448Z binutils_linux-64-2. | 28 KB | ########## | 100%  2025-05-07T20:25:35.7550340Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:25:35.7550622Z 2025-05-07T20:25:35.9091937Z gxx_impl_linux-64-11 | 11.2 MB | ########## | 100%  2025-05-07T20:25:35.9092383Z 2025-05-07T20:25:35.9092388Z 2025-05-07T20:25:36.3749102Z libstdcxx-devel_linu | 11.1 MB | ########## | 100%  2025-05-07T20:25:36.3756131Z gcc_impl_linux-64-11 | 53.0 MB | ########## | 100% 2025-05-07T20:25:36.3756765Z 2025-05-07T20:25:36.3757139Z 2025-05-07T20:25:36.3757517Z  2025-05-07T20:25:36.3757939Z 2025-05-07T20:25:36.3757945Z 2025-05-07T20:25:36.3758240Z  2025-05-07T20:25:36.3758661Z 2025-05-07T20:25:36.3758671Z 2025-05-07T20:25:36.3758679Z 2025-05-07T20:25:36.3758972Z  2025-05-07T20:25:36.3759336Z 2025-05-07T20:25:36.3759344Z 2025-05-07T20:25:36.3759350Z 2025-05-07T20:25:36.3759358Z 2025-05-07T20:25:36.3759675Z  2025-05-07T20:25:36.3760026Z 2025-05-07T20:25:36.3760053Z 2025-05-07T20:25:36.3760058Z 2025-05-07T20:25:36.3760074Z 2025-05-07T20:25:36.3760079Z 2025-05-07T20:25:36.3760359Z  2025-05-07T20:25:36.3760685Z 2025-05-07T20:25:36.3760691Z 2025-05-07T20:25:36.3760696Z 2025-05-07T20:25:36.3760700Z 2025-05-07T20:25:36.3760715Z 2025-05-07T20:25:36.3760728Z 2025-05-07T20:25:36.3761010Z  2025-05-07T20:25:36.3761292Z 2025-05-07T20:25:36.3761296Z 2025-05-07T20:25:36.3761299Z 2025-05-07T20:25:36.3761302Z 2025-05-07T20:25:36.3761306Z 2025-05-07T20:25:36.3761317Z 2025-05-07T20:25:36.3761321Z 2025-05-07T20:25:36.3761533Z  2025-05-07T20:25:36.3761852Z 2025-05-07T20:25:36.3761858Z 2025-05-07T20:25:36.3761863Z 2025-05-07T20:25:36.3761867Z 2025-05-07T20:25:36.3761872Z 2025-05-07T20:25:36.3761887Z 2025-05-07T20:25:36.3761893Z 2025-05-07T20:25:36.3762190Z 2025-05-07T20:25:36.3762496Z  2025-05-07T20:25:36.3762854Z 2025-05-07T20:25:36.3762860Z 2025-05-07T20:25:36.3762865Z 2025-05-07T20:25:36.3762873Z 2025-05-07T20:25:36.3762891Z 2025-05-07T20:25:36.3762897Z 2025-05-07T20:25:36.3762905Z 2025-05-07T20:25:36.3762911Z 2025-05-07T20:25:36.3763120Z 2025-05-07T20:25:36.3763461Z  2025-05-07T20:25:36.3763879Z 2025-05-07T20:25:36.3763899Z 2025-05-07T20:25:36.3763908Z 2025-05-07T20:25:36.3763914Z 2025-05-07T20:25:36.3763920Z 2025-05-07T20:25:36.3763924Z 2025-05-07T20:25:36.3763930Z 2025-05-07T20:25:36.3763935Z 2025-05-07T20:25:36.3763940Z 2025-05-07T20:25:36.3763946Z 2025-05-07T20:25:36.3764235Z  2025-05-07T20:25:36.3764711Z 2025-05-07T20:25:36.3764720Z 2025-05-07T20:25:36.3764727Z 2025-05-07T20:25:36.3764747Z 2025-05-07T20:25:36.3764756Z 2025-05-07T20:25:36.3764764Z 2025-05-07T20:25:36.3764772Z 2025-05-07T20:25:36.3764780Z 2025-05-07T20:25:36.3764788Z 2025-05-07T20:25:36.3764797Z 2025-05-07T20:25:36.3764805Z 2025-05-07T20:25:36.3765183Z  done 2025-05-07T20:25:36.4765772Z Preparing transaction: \ done 2025-05-07T20:25:36.9772679Z Verifying transaction: / - \ | / done 2025-05-07T20:25:37.0783043Z Executing transaction: \ done 2025-05-07T20:25:37.2535924Z [INSTALL] Setting the C/C++ compiler symlinks ... 2025-05-07T20:25:41.2002539Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:25:41.2003118Z 2025-05-07T20:25:41.2017204Z 2025-05-07T20:25:41.2035585Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:41.2036182Z 2025-05-07T20:25:41.2047438Z 2025-05-07T20:25:41.2064629Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:41.2065205Z 2025-05-07T20:25:41.2077761Z 2025-05-07T20:25:41.2095084Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:41.2095662Z 2025-05-07T20:25:41.2107309Z 2025-05-07T20:25:43.1052204Z /home/ec2-user/miniconda/envs/build_binary/bin/cc 2025-05-07T20:25:43.1052494Z 2025-05-07T20:25:43.1702138Z [CHECK] Binary cc found in PATH 2025-05-07T20:25:45.0730530Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc 2025-05-07T20:25:45.0730846Z 2025-05-07T20:25:45.1400140Z [CHECK] Binary gcc found in PATH 2025-05-07T20:25:47.0548728Z /home/ec2-user/miniconda/envs/build_binary/bin/c++ 2025-05-07T20:25:47.0549035Z 2025-05-07T20:25:47.1191187Z [CHECK] Binary c++ found in PATH 2025-05-07T20:25:49.0174508Z /home/ec2-user/miniconda/envs/build_binary/bin/g++ 2025-05-07T20:25:49.0174915Z 2025-05-07T20:25:49.0800001Z [CHECK] Binary g++ found in PATH 2025-05-07T20:25:49.0804781Z [INFO] Printing out all preprocessor defines in the C compiler ... 2025-05-07T20:25:49.0805323Z + conda run -n build_binary cc -dM -E - 2025-05-07T20:25:49.0805544Z 2025-05-07T20:25:50.9906817Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:50.9907269Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:50.9907731Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:50.9908081Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:50.9908421Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:50.9908789Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:50.9909091Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:50.9909415Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:50.9909682Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:50.9909952Z #define __CHAR_BIT__ 8 2025-05-07T20:25:50.9910564Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:50.9910821Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:50.9911179Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:50.9911589Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:50.9911903Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:50.9912354Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:50.9912797Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:50.9913439Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:50.9913843Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:50.9914176Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:50.9914589Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:50.9915020Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:50.9915341Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:50.9915636Z #define __GCC_IEC_559 2 2025-05-07T20:25:50.9915881Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:50.9916177Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:50.9916452Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:50.9916733Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:50.9917079Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:50.9917414Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:50.9917696Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:50.9917987Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:50.9918258Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:50.9918521Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:50.9918784Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:50.9919051Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:50.9919320Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:50.9919569Z #define __INT8_C(c) c 2025-05-07T20:25:50.9919811Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:50.9920121Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:50.9920447Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:50.9920773Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:50.9921141Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:50.9921420Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:50.9921695Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:50.9921984Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:50.9922263Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:50.9922673Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:50.9923112Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:50.9923409Z #define __linux 1 2025-05-07T20:25:50.9923656Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:50.9923954Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:50.9924249Z #define __unix 1 2025-05-07T20:25:50.9924478Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:50.9924776Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:50.9925063Z #define __WINT_MIN__ 0U 2025-05-07T20:25:50.9925311Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:50.9925961Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:50.9926254Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:50.9926523Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:50.9926786Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:50.9927085Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:50.9927386Z #define __INT64_C(c) c ## L 2025-05-07T20:25:50.9927672Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:50.9927982Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:50.9928249Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:50.9928611Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:50.9929008Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:50.9929272Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:50.9929540Z #define __DBL_DIG__ 15 2025-05-07T20:25:50.9929781Z #define __FLT32_DIG__ 6 2025-05-07T20:25:50.9930100Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:50.9930460Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:50.9930874Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:50.9931220Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:50.9931599Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:50.9931883Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:50.9932157Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:50.9932539Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:50.9933070Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:50.9933359Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:50.9933627Z #define __unix__ 1 2025-05-07T20:25:50.9933852Z #define __INT_WIDTH__ 32 2025-05-07T20:25:50.9934104Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:50.9934363Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:50.9934789Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:50.9935071Z #define __UINT16_C(c) c 2025-05-07T20:25:50.9935316Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:50.9935574Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:50.9935950Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:50.9936328Z #define __gnu_linux__ 1 2025-05-07T20:25:50.9936571Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:50.9936859Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:50.9937164Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:50.9937439Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:50.9937716Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:50.9937978Z #define __GNUC__ 11 2025-05-07T20:25:50.9938203Z #define __pie__ 2 2025-05-07T20:25:50.9938419Z #define __MMX__ 1 2025-05-07T20:25:50.9938652Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:50.9938930Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:50.9939212Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:50.9939496Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:50.9939858Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:50.9940272Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:50.9940614Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:50.9940877Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:50.9941153Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:50.9941483Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:50.9941782Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:50.9942055Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:50.9942356Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:50.9942665Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:50.9942936Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:50.9943278Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:50.9943539Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:50.9943814Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:50.9944098Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:50.9944366Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:50.9944629Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:50.9944953Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:50.9945324Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:50.9945604Z #define __SSE2_MATH__ 1 2025-05-07T20:25:50.9945857Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:50.9946163Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:50.9946463Z #define __amd64 1 2025-05-07T20:25:50.9946696Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:50.9946964Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:50.9947290Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:50.9947613Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:50.9947873Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:50.9948161Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:50.9948423Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:50.9948694Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:50.9948960Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:50.9949231Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:50.9949505Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:50.9949789Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:50.9950141Z #define __x86_64 1 2025-05-07T20:25:50.9950385Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:50.9950761Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:50.9951233Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:50.9951700Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:50.9952253Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:50.9952655Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:50.9952921Z #define __LP64__ 1 2025-05-07T20:25:50.9953155Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:50.9953508Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:50.9953903Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:50.9954185Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:50.9954470Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:50.9954771Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:50.9955059Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:50.9955333Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:50.9955605Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:50.9955876Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:50.9956139Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:50.9956479Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:50.9956865Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:50.9957147Z #define __FLT_DIG__ 6 2025-05-07T20:25:50.9957383Z #define __NO_INLINE__ 1 2025-05-07T20:25:50.9957637Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:50.9957973Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:50.9958335Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:50.9958600Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:50.9958875Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:50.9959136Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:50.9959411Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:50.9959682Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:50.9959984Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:50.9960280Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:50.9960556Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:50.9960862Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:50.9961207Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:50.9961481Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:50.9961746Z #define __FLT128_DIG__ 33 2025-05-07T20:25:50.9962010Z #define __INT32_C(c) c 2025-05-07T20:25:50.9962283Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:50.9962571Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:50.9962854Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:50.9963145Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:50.9963466Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:50.9963781Z #define unix 1 2025-05-07T20:25:50.9964014Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:50.9964343Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:50.9964656Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:50.9964976Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:50.9965320Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:50.9965574Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:50.9965846Z #define __ELF__ 1 2025-05-07T20:25:50.9966087Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:50.9966379Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:50.9966659Z #define __FLT_RADIX__ 2 2025-05-07T20:25:50.9966912Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:50.9967281Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:50.9967651Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:50.9967917Z #define __SSE_MATH__ 1 2025-05-07T20:25:50.9968151Z #define __k8 1 2025-05-07T20:25:50.9968447Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:50.9968933Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:50.9969246Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:50.9969549Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:50.9969823Z #define __LDBL_DIG__ 18 2025-05-07T20:25:50.9970074Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:50.9970341Z #define __x86_64__ 1 2025-05-07T20:25:50.9970585Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:50.9971006Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:50.9971357Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:50.9971712Z #define __FLT64_DIG__ 15 2025-05-07T20:25:50.9972007Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:50.9972369Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:50.9972689Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:50.9972966Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:50.9973252Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:50.9973552Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:50.9973932Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:50.9974348Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:50.9974772Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:50.9975125Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:50.9975463Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:50.9975782Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:50.9976068Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:50.9976386Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:50.9976676Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:50.9976919Z #define __SEG_FS 1 2025-05-07T20:25:50.9977153Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:50.9977440Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:50.9977720Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:50.9978018Z #define __SEG_GS 1 2025-05-07T20:25:50.9978345Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:50.9978736Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:50.9979018Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:50.9979320Z #define __INT16_TYPE__ short int 2025-05-07T20:25:50.9979607Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:50.9979906Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:50.9980180Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:50.9980437Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:50.9980700Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:50.9981052Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:50.9981454Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:50.9981746Z #define linux 1 2025-05-07T20:25:50.9981982Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:50.9982270Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:50.9982549Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:50.9982812Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:50.9983080Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:50.9983348Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:50.9983709Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:50.9984131Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:50.9984480Z #define __code_model_small__ 1 2025-05-07T20:25:50.9984759Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:50.9985054Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:50.9985317Z #define __k8__ 1 2025-05-07T20:25:50.9985549Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:50.9985847Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:50.9986158Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:50.9986404Z #define __pic__ 2 2025-05-07T20:25:50.9986660Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:50.9986982Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:50.9987276Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:50.9987619Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:50.9988096Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:50.9988467Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:50.9988747Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:50.9989051Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:50.9989376Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:50.9989629Z #define __linux__ 1 2025-05-07T20:25:50.9989996Z #define __INT64_TYPE__ long int 2025-05-07T20:25:50.9990264Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:50.9990524Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:50.9990802Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:50.9991067Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:50.9991360Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:50.9991703Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:50.9992012Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:50.9992281Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:50.9992588Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:50.9992901Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:50.9993250Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:50.9993614Z #define __SSE__ 1 2025-05-07T20:25:50.9993851Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:50.9994202Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:50.9994548Z #define __amd64__ 1 2025-05-07T20:25:50.9994784Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:50.9995046Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:50.9995315Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:50.9995591Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:50.9995863Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:50.9996138Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:50.9996403Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:50.9996685Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:50.9996955Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:50.9997310Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:50.9997794Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:50.9998161Z #define _LP64 1 2025-05-07T20:25:50.9998382Z #define __UINT8_C(c) c 2025-05-07T20:25:50.9998632Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:50.9998908Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:50.9999180Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:50.9999466Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:50.9999781Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:51.0000141Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:51.0000624Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:51.0001013Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:51.0001314Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:51.0001637Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:51.0002020Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:51.0002411Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:51.0002679Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:51.0003028Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:51.0003410Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:51.0003675Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:51.0003940Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:51.0004203Z #define __FXSR__ 1 2025-05-07T20:25:51.0004505Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:51.0004969Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:51.0014926Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:51.0015291Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:51.0015549Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:51.0015883Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:51.0016253Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:51.0016630Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:51.0016867Z #define __PIC__ 2 2025-05-07T20:25:51.0017116Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:51.0017519Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:51.0017910Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:51.0018239Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:51.0018649Z #define __SSE2__ 1 2025-05-07T20:25:51.0018866Z #define __INT32_TYPE__ int 2025-05-07T20:25:51.0019107Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:51.0019360Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:51.0019692Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:51.0020047Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:51.0020312Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:51.0020575Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:51.0020833Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:51.0021111Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:51.0021361Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:51.0021613Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:51.0021903Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:51.0022212Z #define __PIE__ 2 2025-05-07T20:25:51.0022543Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:51.0022960Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:51.0023318Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:51.0023697Z #define __INT16_C(c) c 2025-05-07T20:25:51.0023919Z #define __STDC__ 1 2025-05-07T20:25:51.0024156Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:51.0024438Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:51.0024696Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:51.0025003Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:51.0025363Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:51.0026022Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:51.0026302Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:51.0026591Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:51.0026864Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:51.0027149Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:51.0027447Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:51.0027726Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:51.0028029Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:51.0028433Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:51.0028820Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:51.0029128Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:51.0029439Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:51.0029700Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:51.0029862Z 2025-05-07T20:25:51.0580865Z 2025-05-07T20:25:51.0581247Z [INFO] Printing out all preprocessor defines in the C++ compiler ... 2025-05-07T20:25:51.0581922Z + conda run -n build_binary c++ -dM -E -x c++ - 2025-05-07T20:25:51.0582281Z 2025-05-07T20:25:52.9725962Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:25:52.9726465Z #define __cpp_attributes 200809L 2025-05-07T20:25:52.9726975Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:25:52.9727511Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:25:52.9727932Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:25:52.9728345Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:25:52.9728751Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:25:52.9729117Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:25:52.9729413Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:25:52.9729736Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:25:52.9730051Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:25:52.9730332Z #define __INTMAX_C(c) c ## L 2025-05-07T20:25:52.9730605Z #define __CHAR_BIT__ 8 2025-05-07T20:25:52.9730847Z #define __UINT8_MAX__ 0xff 2025-05-07T20:25:52.9731112Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:25:52.9733292Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:25:52.9733584Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:25:52.9733884Z #define __cpp_static_assert 201411L 2025-05-07T20:25:52.9734195Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:25:52.9734506Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:52.9734981Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:25:52.9735462Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:25:52.9735797Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:25:52.9736139Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:25:52.9736562Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:25:52.9736989Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:25:52.9737309Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:25:52.9737604Z #define __GCC_IEC_559 2 2025-05-07T20:25:52.9737863Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:25:52.9738143Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:25:52.9738438Z #define __cpp_binary_literals 201304L 2025-05-07T20:25:52.9738752Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:25:52.9739053Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:25:52.9739392Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:25:52.9739727Z #define __cpp_variadic_templates 200704L 2025-05-07T20:25:52.9740073Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:52.9740423Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:25:52.9740710Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:25:52.9741002Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:25:52.9741284Z #define __cpp_variable_templates 201304L 2025-05-07T20:25:52.9741602Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:25:52.9741886Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:25:52.9742179Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:25:52.9742491Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:25:52.9742920Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:25:52.9743268Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:25:52.9743529Z #define __INT8_C(c) c 2025-05-07T20:25:52.9743775Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:25:52.9744053Z #define __cpp_variadic_using 201611L 2025-05-07T20:25:52.9744397Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:52.9744739Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:25:52.9745030Z #define __cpp_capture_star_this 201603L 2025-05-07T20:25:52.9745334Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:25:52.9745666Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:52.9746035Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:25:52.9746332Z #define __cpp_if_constexpr 201606L 2025-05-07T20:25:52.9746632Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:25:52.9746911Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:52.9747198Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:25:52.9747494Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:25:52.9747912Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:25:52.9748335Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:25:52.9748643Z #define __linux 1 2025-05-07T20:25:52.9748884Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:25:52.9749183Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:25:52.9749467Z #define __unix 1 2025-05-07T20:25:52.9749709Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:25:52.9750014Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:25:52.9750312Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:25:52.9750600Z #define __WINT_MIN__ 0U 2025-05-07T20:25:52.9750856Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:25:52.9751144Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:25:52.9751435Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:25:52.9751713Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:25:52.9751969Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:25:52.9752265Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:25:52.9752582Z #define __INT64_C(c) c ## L 2025-05-07T20:25:52.9752948Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:25:52.9753265Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:25:52.9753555Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:25:52.9753871Z #define __cpp_aligned_new 201606L 2025-05-07T20:25:52.9754159Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:25:52.9754439Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:25:52.9754880Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:25:52.9755269Z #define __STDC_HOSTED__ 1 2025-05-07T20:25:52.9755534Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:25:52.9755825Z #define __cpp_decltype_auto 201304L 2025-05-07T20:25:52.9756110Z #define __DBL_DIG__ 15 2025-05-07T20:25:52.9756347Z #define __FLT32_DIG__ 6 2025-05-07T20:25:52.9756658Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:25:52.9757010Z #define __GXX_WEAK__ 1 2025-05-07T20:25:52.9757254Z #define __SHRT_WIDTH__ 16 2025-05-07T20:25:52.9757523Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:25:52.9757857Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:25:52.9758223Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:25:52.9758499Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:25:52.9758813Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:25:52.9759158Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:25:52.9759590Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:25:52.9760013Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:25:52.9760299Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:25:52.9760570Z #define __unix__ 1 2025-05-07T20:25:52.9760805Z #define __INT_WIDTH__ 32 2025-05-07T20:25:52.9761054Z #define __SIZEOF_LONG__ 8 2025-05-07T20:25:52.9761312Z #define __STDC_IEC_559__ 1 2025-05-07T20:25:52.9761578Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:25:52.9761851Z #define __UINT16_C(c) c 2025-05-07T20:25:52.9762097Z #define __DECIMAL_DIG__ 21 2025-05-07T20:25:52.9762374Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:25:52.9762749Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:25:52.9763123Z #define __gnu_linux__ 1 2025-05-07T20:25:52.9763373Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:25:52.9763651Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:25:52.9763940Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:25:52.9764253Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:52.9764540Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:25:52.9764806Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:25:52.9765073Z #define __GNUC__ 11 2025-05-07T20:25:52.9765302Z #define __GXX_RTTI 1 2025-05-07T20:25:52.9765530Z #define __pie__ 2 2025-05-07T20:25:52.9765752Z #define __MMX__ 1 2025-05-07T20:25:52.9765984Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:25:52.9766254Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:25:52.9766562Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:25:52.9766838Z #define __STDC_UTF_16__ 1 2025-05-07T20:25:52.9767095Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:25:52.9767424Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:25:52.9767750Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:25:52.9768106Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:52.9768491Z #define __cpp_raw_strings 200710L 2025-05-07T20:25:52.9768798Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:52.9769130Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:25:52.9769401Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:25:52.9769670Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:25:52.9769987Z #define __cpp_fold_expressions 201603L 2025-05-07T20:25:52.9770290Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:25:52.9770556Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:25:52.9770821Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:25:52.9771115Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:25:52.9771418Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:25:52.9771790Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:25:52.9772076Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:25:52.9772333Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:25:52.9772620Z #define __cplusplus 201703L 2025-05-07T20:25:52.9772905Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:25:52.9773192Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:25:52.9773442Z #define __DEPRECATED 1 2025-05-07T20:25:52.9773775Z #define __cpp_rvalue_references 200610L 2025-05-07T20:25:52.9774073Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:25:52.9774327Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:25:52.9774705Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:52.9775066Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:25:52.9775328Z #define __SSE2_MATH__ 1 2025-05-07T20:25:52.9775576Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:25:52.9775876Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:52.9776168Z #define __amd64 1 2025-05-07T20:25:52.9776389Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:25:52.9776662Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:25:52.9776930Z #define __GNUG__ 11 2025-05-07T20:25:52.9777176Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:25:52.9777491Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:25:52.9777744Z #define __cpp_nsdmi 200809L 2025-05-07T20:25:52.9777996Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:25:52.9778278Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:25:52.9778544Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:25:52.9778815Z #define __cpp_initializer_lists 200806L 2025-05-07T20:25:52.9779112Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:25:52.9779376Z #define __cpp_hex_float 201603L 2025-05-07T20:25:52.9779635Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:25:52.9779903Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:25:52.9780179Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:25:52.9780439Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:25:52.9780710Z #define __x86_64 1 2025-05-07T20:25:52.9780940Z #define __cpp_lambdas 200907L 2025-05-07T20:25:52.9781216Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:25:52.9781581Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:25:52.9781975Z #define __cpp_template_auto 201606L 2025-05-07T20:25:52.9782335Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:25:52.9782840Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:25:52.9783318Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:52.9783713Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:25:52.9783962Z #define __LP64__ 1 2025-05-07T20:25:52.9784191Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:25:52.9784539Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:25:52.9784925Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:25:52.9785196Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:25:52.9785483Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:25:52.9785768Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:25:52.9786032Z #define __REGISTER_PREFIX__ 2025-05-07T20:25:52.9786293Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:25:52.9786562Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:25:52.9786886Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:25:52.9787248Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:25:52.9787528Z #define __FLT_DIG__ 6 2025-05-07T20:25:52.9787751Z #define __NO_INLINE__ 1 2025-05-07T20:25:52.9787991Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:25:52.9788314Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:25:52.9788661Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:25:52.9788915Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:25:52.9789184Z #define __VERSION__ "11.4.0" 2025-05-07T20:25:52.9789442Z #define __UINT64_C(c) c ## UL 2025-05-07T20:25:52.9789708Z #define __cpp_unicode_characters 201411L 2025-05-07T20:25:52.9790006Z #define _STDC_PREDEF_H 1 2025-05-07T20:25:52.9790393Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:25:52.9790684Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:25:52.9790973Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:25:52.9791243Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:25:52.9791537Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:52.9791882Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:25:52.9792287Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:25:52.9792550Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:25:52.9792813Z #define __FLT128_DIG__ 33 2025-05-07T20:25:52.9793060Z #define __INT32_C(c) c 2025-05-07T20:25:52.9793296Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:25:52.9793577Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:25:52.9793863Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:25:52.9794146Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:25:52.9794462Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:25:52.9794780Z #define unix 1 2025-05-07T20:25:52.9794999Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:25:52.9795257Z #define __cpp_rtti 199711L 2025-05-07T20:25:52.9795522Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:25:52.9795839Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:52.9796142Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:25:52.9796453Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:25:52.9796785Z #define __FLT64X_DIG__ 18 2025-05-07T20:25:52.9797037Z #define __INT8_TYPE__ signed char 2025-05-07T20:25:52.9797327Z #define __cpp_digit_separators 201309L 2025-05-07T20:25:52.9797608Z #define __ELF__ 1 2025-05-07T20:25:52.9797832Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:25:52.9798119Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:25:52.9798400Z #define __FLT_RADIX__ 2 2025-05-07T20:25:52.9798646Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:25:52.9799000Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:25:52.9799371Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:25:52.9799651Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:25:52.9799918Z #define __k8 1 2025-05-07T20:25:52.9800211Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:25:52.9800599Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:25:52.9800890Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:25:52.9801193Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:25:52.9801456Z #define __LDBL_DIG__ 18 2025-05-07T20:25:52.9801689Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:25:52.9801947Z #define __x86_64__ 1 2025-05-07T20:25:52.9802184Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:25:52.9802479Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:25:52.9802820Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:52.9803129Z #define __FLT64_DIG__ 15 2025-05-07T20:25:52.9803407Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:52.9803751Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:25:52.9804071Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:25:52.9804340Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:25:52.9804612Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:52.9804913Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:25:52.9805276Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:25:52.9805669Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:25:52.9805961Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:25:52.9806285Z #define __cpp_unicode_literals 200710L 2025-05-07T20:25:52.9806598Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:25:52.9806925Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:25:52.9807224Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:25:52.9807503Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:25:52.9807808Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:25:52.9808088Z #define __SIZE_WIDTH__ 64 2025-05-07T20:25:52.9808326Z #define __SEG_FS 1 2025-05-07T20:25:52.9808550Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:25:52.9808918Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:25:52.9809204Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:52.9809488Z #define __SEG_GS 1 2025-05-07T20:25:52.9809799Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:25:52.9810185Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:25:52.9810454Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:25:52.9810824Z #define __INT16_TYPE__ short int 2025-05-07T20:25:52.9811110Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:25:52.9811425Z #define __cpp_structured_bindings 201606L 2025-05-07T20:25:52.9811716Z #define __SIZEOF_INT__ 4 2025-05-07T20:25:52.9811964Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:25:52.9812228Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:25:52.9812576Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:52.9813006Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:52.9813325Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:25:52.9813651Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:25:52.9813953Z #define linux 1 2025-05-07T20:25:52.9814177Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:25:52.9814448Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:25:52.9814779Z #define __EXCEPTIONS 1 2025-05-07T20:25:52.9815026Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:25:52.9815284Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:25:52.9815561Z #define __cpp_range_based_for 201603L 2025-05-07T20:25:52.9815855Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:25:52.9816207Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:25:52.9816594Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:25:52.9816941Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:25:52.9817278Z #define __code_model_small__ 1 2025-05-07T20:25:52.9817545Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:25:52.9817853Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:25:52.9818162Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:25:52.9818431Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:25:52.9818728Z #define __k8__ 1 2025-05-07T20:25:52.9818954Z #define __INTPTR_TYPE__ long int 2025-05-07T20:25:52.9819238Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:25:52.9819541Z #define __WCHAR_TYPE__ int 2025-05-07T20:25:52.9819786Z #define __pic__ 2 2025-05-07T20:25:52.9820037Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:52.9820344Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:25:52.9820611Z #define __cpp_decltype 200707L 2025-05-07T20:25:52.9820904Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:52.9821229Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:25:52.9821603Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:52.9821966Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:25:52.9822261Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:25:52.9822619Z #define __cpp_inline_variables 201606L 2025-05-07T20:25:52.9822945Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:25:52.9823191Z #define __linux__ 1 2025-05-07T20:25:52.9823417Z #define __INT64_TYPE__ long int 2025-05-07T20:25:52.9823681Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:25:52.9832919Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:25:52.9833262Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:25:52.9833567Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:25:52.9833922Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:25:52.9834225Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:52.9834555Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:25:52.9834830Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:25:52.9835126Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:25:52.9835434Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:25:52.9835781Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:25:52.9836147Z #define __SSE__ 1 2025-05-07T20:25:52.9836386Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:25:52.9836923Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:52.9837280Z #define __amd64__ 1 2025-05-07T20:25:52.9837513Z #define __WINT_WIDTH__ 32 2025-05-07T20:25:52.9837782Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:25:52.9838067Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:25:52.9838333Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:25:52.9838750Z #define __SIZEOF_INT128__ 16 2025-05-07T20:25:52.9839018Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:25:52.9839300Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:25:52.9839578Z #define __ATOMIC_RELAXED 0 2025-05-07T20:25:52.9839935Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:25:52.9840412Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:25:52.9840784Z #define _LP64 1 2025-05-07T20:25:52.9841010Z #define __UINT8_C(c) c 2025-05-07T20:25:52.9841254Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:25:52.9841536Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:25:52.9841819Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:25:52.9842083Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:25:52.9842452Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:25:52.9842936Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:25:52.9843324Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:25:52.9843632Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:25:52.9843957Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:25:52.9844283Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:25:52.9844669Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:25:52.9845061Z #define __STDCPP_THREADS__ 1 2025-05-07T20:25:52.9845339Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:25:52.9845606Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:25:52.9845965Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:25:52.9846359Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:25:52.9846633Z #define __STDC_UTF_32__ 1 2025-05-07T20:25:52.9846884Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:25:52.9847140Z #define __FXSR__ 1 2025-05-07T20:25:52.9847455Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:52.9847917Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:25:52.9848350Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:25:52.9848670Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:25:52.9848940Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:25:52.9849254Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:25:52.9849565Z #define __UINT32_C(c) c ## U 2025-05-07T20:25:52.9849839Z #define __cpp_alias_templates 200704L 2025-05-07T20:25:52.9850209Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:25:52.9850581Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:25:52.9850852Z #define __INT8_MAX__ 0x7f 2025-05-07T20:25:52.9851106Z #define __LONG_WIDTH__ 64 2025-05-07T20:25:52.9851347Z #define __PIC__ 2 2025-05-07T20:25:52.9851598Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:25:52.9851995Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:25:52.9852387Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:25:52.9852726Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:25:52.9853078Z #define __cpp_constexpr 201603L 2025-05-07T20:25:52.9853345Z #define __SSE2__ 1 2025-05-07T20:25:52.9853588Z #define __cpp_deduction_guides 201703L 2025-05-07T20:25:52.9853878Z #define __INT32_TYPE__ int 2025-05-07T20:25:52.9854134Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:25:52.9854402Z #define __cpp_exceptions 199711L 2025-05-07T20:25:52.9854760Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:25:52.9855102Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:25:52.9855467Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:25:52.9855836Z #define __INTMAX_TYPE__ long int 2025-05-07T20:25:52.9857564Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:25:52.9857837Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:25:52.9858116Z #define __ATOMIC_CONSUME 1 2025-05-07T20:25:52.9858359Z #define __GNUC_MINOR__ 4 2025-05-07T20:25:52.9858613Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:25:52.9858908Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:25:52.9859277Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:52.9859574Z #define __PIE__ 2 2025-05-07T20:25:52.9859895Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:25:52.9860317Z #define __cpp_template_template_args 201611L 2025-05-07T20:25:52.9860636Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:25:52.9860985Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:25:52.9861354Z #define __INT16_C(c) c 2025-05-07T20:25:52.9861574Z #define __STDC__ 1 2025-05-07T20:25:52.9861795Z #define __FLT32X_DIG__ 15 2025-05-07T20:25:52.9862055Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:25:52.9862324Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:25:52.9862583Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:25:52.9862882Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:25:52.9863225Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:25:52.9863567Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:25:52.9863838Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:25:52.9864122Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:25:52.9864406Z #define __SSE_MATH__ 1 2025-05-07T20:25:52.9864655Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:25:52.9864934Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:25:52.9865241Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:25:52.9865528Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:25:52.9865819Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:25:52.9866089Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:25:52.9866393Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:25:52.9866792Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:25:52.9867173Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:25:52.9867480Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:25:52.9867775Z #define _GNU_SOURCE 1 2025-05-07T20:25:52.9868015Z #define __cpp_init_captures 201304L 2025-05-07T20:25:52.9868301Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:25:52.9868550Z #define __ATOMIC_RELEASE 3 2025-05-07T20:25:52.9868710Z 2025-05-07T20:25:53.0402414Z 2025-05-07T20:25:53.0403057Z + conda run -n build_binary c++ --version 2025-05-07T20:25:53.0403654Z 2025-05-07T20:25:54.9421928Z c++ (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:25:54.9422483Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T20:25:54.9422962Z This is free software; see the source for copying conditions. There is NO 2025-05-07T20:25:54.9423529Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T20:25:54.9423903Z 2025-05-07T20:25:54.9423909Z 2025-05-07T20:25:55.0149351Z 2025-05-07T20:25:55.0150197Z [INFO] Printing the default version of the C standard used by the compiler ... 2025-05-07T20:25:55.0150870Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__ 2025-05-07T20:25:55.0151187Z 2025-05-07T20:25:56.9939562Z #define __STDC_VERSION__ 201710L 2025-05-07T20:25:56.9941903Z 2025-05-07T20:25:56.9942457Z [INFO] Printing the default version of the C++ standard used by the compiler ... 2025-05-07T20:25:56.9943271Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus 2025-05-07T20:25:56.9943730Z 2025-05-07T20:25:58.9855703Z #define __cplusplus 201703L 2025-05-07T20:25:58.9858137Z 2025-05-07T20:25:58.9859290Z [INSTALL] Successfully installed C/C++ compilers 2025-05-07T20:25:58.9896569Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:58.9897025Z . $PRELUDE; install_cuda $BUILD_ENV 12.6.3 2025-05-07T20:25:58.9909286Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:58.9909660Z env: 2025-05-07T20:25:58.9909907Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:25:58.9910221Z BUILD_ENV: build_binary 2025-05-07T20:25:58.9910481Z BUILD_TARGET: genai 2025-05-07T20:25:58.9910740Z BUILD_VARIANT: cuda 2025-05-07T20:25:58.9910989Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:25:58.9911462Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:25:58.9911784Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:25:58.9912143Z ##[endgroup] 2025-05-07T20:25:59.3334929Z ################################################################################ 2025-05-07T20:25:59.3335313Z # Install CUDA 2025-05-07T20:25:59.3335527Z # 2025-05-07T20:25:59.3351763Z # [2025-05-07T20:25:59.334Z] + install_cuda build_binary 12.6.3 2025-05-07T20:25:59.3352352Z ################################################################################ 2025-05-07T20:25:59.3352689Z 2025-05-07T20:25:59.3368973Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:59.4230782Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:59.4231348Z [SETUP] Cleaning up Conda packages ... 2025-05-07T20:25:59.4236414Z + conda clean --packages --tarball -y 2025-05-07T20:25:59.4236653Z 2025-05-07T20:26:00.3039237Z Will remove 40 (182.7 MB) tarball(s). 2025-05-07T20:26:00.3039625Z Will remove 7 (108.6 MB) package(s). 2025-05-07T20:26:00.3757723Z 2025-05-07T20:26:00.3769207Z + conda clean --all -y 2025-05-07T20:26:00.3769382Z 2025-05-07T20:26:01.0520780Z There are no unused tarball(s) to remove. 2025-05-07T20:26:01.0521655Z Will remove 1 index cache(s). 2025-05-07T20:26:01.0522347Z There are no unused package(s) to remove. 2025-05-07T20:26:01.0523059Z There are no tempfile(s) to remove. 2025-05-07T20:26:01.0523647Z There are no logfile(s) to remove. 2025-05-07T20:26:01.1190911Z 2025-05-07T20:26:01.1205868Z [INSTALL] Installing CUDA 12.6.3 ... 2025-05-07T20:26:01.1231850Z [EXEC] [ATTEMPT 0/3] + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.6.3 2025-05-07T20:26:02.0325968Z Channels: 2025-05-07T20:26:02.0326222Z - conda-forge 2025-05-07T20:26:02.0326475Z Platform: linux-64 2025-05-07T20:26:12.9108405Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:26:14.0257885Z Solving environment: \ | / - \ done 2025-05-07T20:26:14.1017418Z 2025-05-07T20:26:14.1017721Z ## Package Plan ## 2025-05-07T20:26:14.1017886Z 2025-05-07T20:26:14.1018233Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:26:14.1018673Z 2025-05-07T20:26:14.1018827Z added / updated specs: 2025-05-07T20:26:14.1019131Z - cuda=12.6.3 2025-05-07T20:26:14.1019265Z 2025-05-07T20:26:14.1019300Z 2025-05-07T20:26:14.1019420Z The following packages will be downloaded: 2025-05-07T20:26:14.1019640Z 2025-05-07T20:26:14.1019752Z package | build 2025-05-07T20:26:14.1020078Z ---------------------------|----------------- 2025-05-07T20:26:14.1020454Z alsa-lib-1.2.14 | hb9d3cd8_0 553 KB conda-forge 2025-05-07T20:26:14.1020864Z attr-2.5.1 | h166bdaf_1 69 KB conda-forge 2025-05-07T20:26:14.1021436Z binutils-2.40 | h4852527_7 31 KB conda-forge 2025-05-07T20:26:14.1022104Z c-compiler-1.5.2 | h0b41bf4_0 6 KB conda-forge 2025-05-07T20:26:14.1022677Z cuda-12.6.3 | ha804496_0 26 KB conda-forge 2025-05-07T20:26:14.1023200Z cuda-cccl_linux-64-12.6.77 | ha770c72_0 1.0 MB conda-forge 2025-05-07T20:26:14.1024420Z cuda-command-line-tools-12.6.3| ha770c72_0 20 KB conda-forge 2025-05-07T20:26:14.1024950Z cuda-compiler-12.6.3 | hbad6d8a_0 20 KB conda-forge 2025-05-07T20:26:14.1025742Z cuda-crt-dev_linux-64-12.6.85| ha770c72_0 87 KB conda-forge 2025-05-07T20:26:14.1026239Z cuda-crt-tools-12.6.85 | ha770c72_0 26 KB conda-forge 2025-05-07T20:26:14.1026712Z cuda-cudart-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:26:14.1027191Z cuda-cudart-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:26:14.1027699Z cuda-cudart-dev_linux-64-12.6.77| h3f2d84a_0 357 KB conda-forge 2025-05-07T20:26:14.1028410Z cuda-cudart-static-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:26:14.1028953Z cuda-cudart-static_linux-64-12.6.77| h3f2d84a_0 744 KB conda-forge 2025-05-07T20:26:14.1029482Z cuda-cudart_linux-64-12.6.77| h3f2d84a_0 184 KB conda-forge 2025-05-07T20:26:14.1029983Z cuda-cuobjdump-12.6.77 | hbd13f7d_1 241 KB conda-forge 2025-05-07T20:26:14.1030453Z cuda-cupti-12.6.80 | hbd13f7d_0 1.9 MB conda-forge 2025-05-07T20:26:14.1030922Z cuda-cupti-dev-12.6.80 | h5888daf_0 3.4 MB conda-forge 2025-05-07T20:26:14.1031395Z cuda-cuxxfilt-12.6.77 | hbd13f7d_1 211 KB conda-forge 2025-05-07T20:26:14.1031883Z cuda-driver-dev-12.6.77 | h5888daf_0 22 KB conda-forge 2025-05-07T20:26:14.1032399Z cuda-driver-dev_linux-64-12.6.77| h3f2d84a_0 35 KB conda-forge 2025-05-07T20:26:14.1032888Z cuda-gdb-12.6.77 | h50b4baa_1 370 KB conda-forge 2025-05-07T20:26:14.1033349Z cuda-libraries-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:26:14.1033846Z cuda-libraries-dev-12.6.3 | ha770c72_0 20 KB conda-forge 2025-05-07T20:26:14.1034335Z cuda-nsight-12.6.77 | h7938cbb_0 113.2 MB conda-forge 2025-05-07T20:26:14.1034784Z cuda-nvcc-12.6.85 | hcdd1206_0 23 KB conda-forge 2025-05-07T20:26:14.1035260Z cuda-nvcc-dev_linux-64-12.6.85| he91c749_0 10.8 MB conda-forge 2025-05-07T20:26:14.1035752Z cuda-nvcc-impl-12.6.85 | h85509e4_0 25 KB conda-forge 2025-05-07T20:26:14.1036227Z cuda-nvcc-tools-12.6.85 | he02047a_0 23.0 MB conda-forge 2025-05-07T20:26:14.1036718Z cuda-nvcc_linux-64-12.6.85 | h04802cd_0 25 KB conda-forge 2025-05-07T20:26:14.1037205Z cuda-nvdisasm-12.6.77 | hbd13f7d_1 47.6 MB conda-forge 2025-05-07T20:26:14.1037679Z cuda-nvml-dev-12.6.77 | hbd13f7d_1 159 KB conda-forge 2025-05-07T20:26:14.1038138Z cuda-nvprof-12.6.80 | hbd13f7d_0 2.6 MB conda-forge 2025-05-07T20:26:14.1038603Z cuda-nvprune-12.6.77 | hbd13f7d_1 66 KB conda-forge 2025-05-07T20:26:14.1039076Z cuda-nvrtc-12.6.85 | hbd13f7d_0 17.3 MB conda-forge 2025-05-07T20:26:14.1039546Z cuda-nvrtc-dev-12.6.85 | h5888daf_0 31 KB conda-forge 2025-05-07T20:26:14.1039999Z cuda-nvtx-12.6.77 | hbd13f7d_0 31 KB conda-forge 2025-05-07T20:26:14.1040476Z cuda-nvvm-dev_linux-64-12.6.85| ha770c72_0 25 KB conda-forge 2025-05-07T20:26:14.1040971Z cuda-nvvm-impl-12.6.85 | he02047a_0 7.7 MB conda-forge 2025-05-07T20:26:14.1041445Z cuda-nvvm-tools-12.6.85 | he02047a_0 10.4 MB conda-forge 2025-05-07T20:26:14.1041917Z cuda-nvvp-12.6.80 | hbd13f7d_1 109.3 MB conda-forge 2025-05-07T20:26:14.1042372Z cuda-opencl-12.6.77 | hbd13f7d_0 29 KB conda-forge 2025-05-07T20:26:14.1042850Z cuda-opencl-dev-12.6.77 | h5888daf_0 93 KB conda-forge 2025-05-07T20:26:14.1043509Z cuda-profiler-api-12.6.77 | h7938cbb_0 22 KB conda-forge 2025-05-07T20:26:14.1043998Z cuda-runtime-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:26:14.1044489Z cuda-sanitizer-api-12.6.77 | hbd13f7d_1 8.9 MB conda-forge 2025-05-07T20:26:14.1044969Z cuda-toolkit-12.6.3 | ha804496_0 19 KB conda-forge 2025-05-07T20:26:14.1045424Z cuda-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:26:14.1045877Z cuda-version-12.6 | h7480c83_3 20 KB conda-forge 2025-05-07T20:26:14.1046356Z cuda-visual-tools-12.6.3 | ha770c72_0 19 KB conda-forge 2025-05-07T20:26:14.1046913Z cxx-compiler-1.5.2 | hf52228f_0 6 KB conda-forge 2025-05-07T20:26:14.1047344Z dbus-1.13.6 | h5008d03_3 604 KB conda-forge 2025-05-07T20:26:14.1047874Z font-ttf-dejavu-sans-mono-2.37| hab24e00_0 388 KB conda-forge 2025-05-07T20:26:14.1048412Z font-ttf-inconsolata-3.000 | h77eed37_0 94 KB conda-forge 2025-05-07T20:26:14.1048952Z font-ttf-source-code-pro-2.038| h77eed37_0 684 KB conda-forge 2025-05-07T20:26:14.1049507Z font-ttf-ubuntu-0.83 | h77eed37_3 1.5 MB conda-forge 2025-05-07T20:26:14.1049972Z fontconfig-2.15.0 | h7e30c49_1 259 KB conda-forge 2025-05-07T20:26:14.1050454Z fonts-conda-ecosystem-1 | 0 4 KB conda-forge 2025-05-07T20:26:14.1050949Z fonts-conda-forge-1 | 0 4 KB conda-forge 2025-05-07T20:26:14.1051406Z freetype-2.13.3 | ha770c72_1 168 KB conda-forge 2025-05-07T20:26:14.1051826Z gcc-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:26:14.1052243Z gds-tools-1.11.1.6 | h5888daf_4 37.8 MB conda-forge 2025-05-07T20:26:14.1052665Z gmp-6.3.0 | hac33072_2 449 KB conda-forge 2025-05-07T20:26:14.1053046Z gxx-11.4.0 | h602e360_13 49 KB conda-forge 2025-05-07T20:26:14.1053451Z keyutils-1.6.1 | h166bdaf_0 115 KB conda-forge 2025-05-07T20:26:14.1053857Z krb5-1.21.3 | h659f571_0 1.3 MB conda-forge 2025-05-07T20:26:14.1054251Z libcap-2.71 | h39aace5_0 100 KB conda-forge 2025-05-07T20:26:14.1054807Z libcublas-12.6.4.1 | h5888daf_1 256.2 MB conda-forge 2025-05-07T20:26:14.1055264Z libcublas-dev-12.6.4.1 | h5888daf_1 88 KB conda-forge 2025-05-07T20:26:14.1055725Z libcufft-11.3.0.4 | hbd13f7d_0 156.2 MB conda-forge 2025-05-07T20:26:14.1056170Z libcufft-dev-11.3.0.4 | h5888daf_0 33 KB conda-forge 2025-05-07T20:26:14.1056622Z libcufile-1.11.1.6 | h12f29b5_4 900 KB conda-forge 2025-05-07T20:26:14.1057082Z libcufile-dev-1.11.1.6 | h5888daf_4 35 KB conda-forge 2025-05-07T20:26:14.1057539Z libcurand-10.3.7.77 | hbd13f7d_0 39.9 MB conda-forge 2025-05-07T20:26:14.1058050Z libcurand-dev-10.3.7.77 | h5888daf_0 262 KB conda-forge 2025-05-07T20:26:14.1058514Z libcusolver-11.7.1.2 | h5888daf_1 95.8 MB conda-forge 2025-05-07T20:26:14.1058987Z libcusolver-dev-11.7.1.2 | h5888daf_1 59 KB conda-forge 2025-05-07T20:26:14.1059456Z libcusparse-12.5.4.2 | hbd13f7d_0 118.6 MB conda-forge 2025-05-07T20:26:14.1059937Z libcusparse-dev-12.5.4.2 | h5888daf_0 51 KB conda-forge 2025-05-07T20:26:14.1060416Z libedit-3.1.20191231 | he28a2e2_2 121 KB conda-forge 2025-05-07T20:26:14.1060870Z libfreetype-2.13.3 | ha770c72_1 8 KB conda-forge 2025-05-07T20:26:14.1061330Z libfreetype6-2.13.3 | h48d6fc4_1 371 KB conda-forge 2025-05-07T20:26:14.1061900Z libgcrypt-lib-1.11.0 | hb9d3cd8_2 572 KB conda-forge 2025-05-07T20:26:14.1062350Z libglib-2.84.0 | h2ff4ddf_0 3.8 MB conda-forge 2025-05-07T20:26:14.1062783Z libgpg-error-1.55 | h3f2d84a_0 305 KB conda-forge 2025-05-07T20:26:14.1063228Z libiconv-1.18 | h4ce23a2_1 696 KB conda-forge 2025-05-07T20:26:14.1063650Z libnl-3.11.0 | hb9d3cd8_0 724 KB conda-forge 2025-05-07T20:26:14.1064068Z libnpp-12.3.1.54 | h5888daf_0 93.4 MB conda-forge 2025-05-07T20:26:14.1064585Z libnpp-dev-12.3.1.54 | h5888daf_0 441 KB conda-forge 2025-05-07T20:26:14.1065024Z libnuma-2.0.18 | h4ab18f5_2 42 KB conda-forge 2025-05-07T20:26:14.1065464Z libnvfatbin-12.6.77 | hbd13f7d_0 783 KB conda-forge 2025-05-07T20:26:14.1065937Z libnvfatbin-dev-12.6.77 | h5888daf_0 26 KB conda-forge 2025-05-07T20:26:14.1066411Z libnvjitlink-12.6.85 | hbd13f7d_0 14.9 MB conda-forge 2025-05-07T20:26:14.1066888Z libnvjitlink-dev-12.6.85 | h5888daf_0 25 KB conda-forge 2025-05-07T20:26:14.1067358Z libnvjpeg-12.3.3.54 | h5888daf_0 2.4 MB conda-forge 2025-05-07T20:26:14.1067865Z libnvjpeg-dev-12.3.3.54 | ha770c72_0 31 KB conda-forge 2025-05-07T20:26:14.1068308Z libpng-1.6.47 | h943b412_0 282 KB conda-forge 2025-05-07T20:26:14.1068740Z libsqlite-3.49.2 | hee588c1_0 895 KB conda-forge 2025-05-07T20:26:14.1069196Z libsystemd0-256.9 | h2774228_0 401 KB conda-forge 2025-05-07T20:26:14.1069630Z libudev1-257.4 | h9a4d06a_0 140 KB conda-forge 2025-05-07T20:26:14.1070053Z libxcb-1.17.0 | h8a09558_0 387 KB conda-forge 2025-05-07T20:26:14.1070496Z libxkbcommon-1.8.0 | hc4a0caf_0 627 KB conda-forge 2025-05-07T20:26:14.1079687Z libxkbfile-1.1.0 | h166bdaf_1 111 KB conda-forge 2025-05-07T20:26:14.1080176Z libxml2-2.13.5 | h064dc61_0 673 KB conda-forge 2025-05-07T20:26:14.1080620Z libzlib-1.3.1 | hb9d3cd8_2 60 KB conda-forge 2025-05-07T20:26:14.1081039Z lz4-c-1.9.4 | hcb278e6_0 140 KB conda-forge 2025-05-07T20:26:14.1081508Z nsight-compute-2024.3.2.3 | hb5ebaad_0 443.1 MB conda-forge 2025-05-07T20:26:14.1081978Z nspr-4.36 | h5888daf_0 225 KB conda-forge 2025-05-07T20:26:14.1082388Z nss-3.111 | h159eef7_0 1.9 MB conda-forge 2025-05-07T20:26:14.1082803Z ocl-icd-2.3.3 | hb9d3cd8_0 104 KB conda-forge 2025-05-07T20:26:14.1083269Z opencl-headers-2024.10.24 | h5888daf_0 53 KB conda-forge 2025-05-07T20:26:14.1083731Z pcre2-10.44 | hc749103_2 934 KB conda-forge 2025-05-07T20:26:14.1084179Z pthread-stubs-0.4 | hb9d3cd8_1002 8 KB conda-forge 2025-05-07T20:26:14.1084623Z rdma-core-55.0 | h5888daf_0 1.2 MB conda-forge 2025-05-07T20:26:14.1085037Z sqlite-3.32.3 | hcee41ef_1 1.4 MB conda-forge 2025-05-07T20:26:14.1085451Z tk-8.6.13 |noxft_h4845f30_101 3.2 MB conda-forge 2025-05-07T20:26:14.1085864Z wayland-1.23.1 | h3e06ad9_0 314 KB conda-forge 2025-05-07T20:26:14.1086283Z xcb-util-0.4.1 | hb711507_2 19 KB conda-forge 2025-05-07T20:26:14.1086739Z xcb-util-cursor-0.1.5 | hb9d3cd8_0 20 KB conda-forge 2025-05-07T20:26:14.1087220Z xcb-util-image-0.4.0 | hb711507_2 24 KB conda-forge 2025-05-07T20:26:14.1087850Z xcb-util-keysyms-0.4.1 | hb711507_0 14 KB conda-forge 2025-05-07T20:26:14.1088364Z xcb-util-renderutil-0.3.10 | hb711507_0 17 KB conda-forge 2025-05-07T20:26:14.1088844Z xcb-util-wm-0.4.2 | hb711507_0 50 KB conda-forge 2025-05-07T20:26:14.1089314Z xkeyboard-config-2.44 | hb9d3cd8_0 384 KB conda-forge 2025-05-07T20:26:14.1089781Z xorg-libice-1.1.2 | hb9d3cd8_0 57 KB conda-forge 2025-05-07T20:26:14.1090227Z xorg-libsm-1.2.6 | he73a12e_0 27 KB conda-forge 2025-05-07T20:26:14.1090785Z xorg-libx11-1.8.12 | h4f16b4b_0 816 KB conda-forge 2025-05-07T20:26:14.1091227Z xorg-libxau-1.0.12 | hb9d3cd8_0 14 KB conda-forge 2025-05-07T20:26:14.1091709Z xorg-libxcomposite-0.4.6 | hb9d3cd8_2 13 KB conda-forge 2025-05-07T20:26:14.1092210Z xorg-libxdamage-1.1.6 | hb9d3cd8_0 13 KB conda-forge 2025-05-07T20:26:14.1092688Z xorg-libxdmcp-1.1.5 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:26:14.1093140Z xorg-libxext-1.3.6 | hb9d3cd8_0 49 KB conda-forge 2025-05-07T20:26:14.1093603Z xorg-libxfixes-6.0.1 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T20:26:14.1094056Z xorg-libxi-1.8.2 | hb9d3cd8_0 46 KB conda-forge 2025-05-07T20:26:14.1094501Z xorg-libxrandr-1.5.4 | hb9d3cd8_0 29 KB conda-forge 2025-05-07T20:26:14.1095089Z xorg-libxrender-0.9.12 | hb9d3cd8_0 32 KB conda-forge 2025-05-07T20:26:14.1095569Z xorg-libxtst-1.2.5 | hb9d3cd8_3 32 KB conda-forge 2025-05-07T20:26:14.1095994Z zlib-1.3.1 | hb9d3cd8_2 90 KB conda-forge 2025-05-07T20:26:14.1096377Z zstd-1.5.7 | hb8e6e7a_2 554 KB conda-forge 2025-05-07T20:26:14.1096770Z ------------------------------------------------------------ 2025-05-07T20:26:14.1097121Z Total: 1.61 GB 2025-05-07T20:26:14.1097339Z 2025-05-07T20:26:14.1097465Z The following NEW packages will be INSTALLED: 2025-05-07T20:26:14.1097722Z 2025-05-07T20:26:14.1097960Z alsa-lib conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 2025-05-07T20:26:14.1098387Z attr conda-forge/linux-64::attr-2.5.1-h166bdaf_1 2025-05-07T20:26:14.1098812Z binutils conda-forge/linux-64::binutils-2.40-h4852527_7 2025-05-07T20:26:14.1099276Z c-compiler conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 2025-05-07T20:26:14.1099712Z cuda conda-forge/noarch::cuda-12.6.3-ha804496_0 2025-05-07T20:26:14.1100191Z cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.6.77-ha770c72_0 2025-05-07T20:26:14.1100802Z cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.6.3-ha770c72_0 2025-05-07T20:26:14.1101391Z cuda-compiler conda-forge/noarch::cuda-compiler-12.6.3-hbad6d8a_0 2025-05-07T20:26:14.1101948Z cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:26:14.1102522Z cuda-crt-tools conda-forge/linux-64::cuda-crt-tools-12.6.85-ha770c72_0 2025-05-07T20:26:14.1103227Z cuda-cudart conda-forge/linux-64::cuda-cudart-12.6.77-h5888daf_0 2025-05-07T20:26:14.1103757Z cuda-cudart-dev conda-forge/linux-64::cuda-cudart-dev-12.6.77-h5888daf_0 2025-05-07T20:26:14.1104355Z cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:26:14.1104985Z cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.6.77-h5888daf_0 2025-05-07T20:26:14.1105624Z cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:26:14.1106240Z cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:26:14.1106914Z cuda-cuobjdump conda-forge/linux-64::cuda-cuobjdump-12.6.77-hbd13f7d_1 2025-05-07T20:26:14.1107455Z cuda-cupti conda-forge/linux-64::cuda-cupti-12.6.80-hbd13f7d_0 2025-05-07T20:26:14.1108035Z cuda-cupti-dev conda-forge/linux-64::cuda-cupti-dev-12.6.80-h5888daf_0 2025-05-07T20:26:14.1108578Z cuda-cuxxfilt conda-forge/linux-64::cuda-cuxxfilt-12.6.77-hbd13f7d_1 2025-05-07T20:26:14.1109129Z cuda-driver-dev conda-forge/linux-64::cuda-driver-dev-12.6.77-h5888daf_0 2025-05-07T20:26:14.1109798Z cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.6.77-h3f2d84a_0 2025-05-07T20:26:14.1110672Z cuda-gdb conda-forge/linux-64::cuda-gdb-12.6.77-h50b4baa_1 2025-05-07T20:26:14.1111353Z cuda-libraries conda-forge/linux-64::cuda-libraries-12.6.3-ha770c72_0 2025-05-07T20:26:14.1112009Z cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.6.3-ha770c72_0 2025-05-07T20:26:14.1112586Z cuda-nsight conda-forge/linux-64::cuda-nsight-12.6.77-h7938cbb_0 2025-05-07T20:26:14.1113091Z cuda-nvcc conda-forge/linux-64::cuda-nvcc-12.6.85-hcdd1206_0 2025-05-07T20:26:14.1113623Z cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.6.85-he91c749_0 2025-05-07T20:26:14.1114375Z cuda-nvcc-impl conda-forge/linux-64::cuda-nvcc-impl-12.6.85-h85509e4_0 2025-05-07T20:26:14.1115106Z cuda-nvcc-tools conda-forge/linux-64::cuda-nvcc-tools-12.6.85-he02047a_0 2025-05-07T20:26:14.1115675Z cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.6.85-h04802cd_0 2025-05-07T20:26:14.1116240Z cuda-nvdisasm conda-forge/linux-64::cuda-nvdisasm-12.6.77-hbd13f7d_1 2025-05-07T20:26:14.1116786Z cuda-nvml-dev conda-forge/linux-64::cuda-nvml-dev-12.6.77-hbd13f7d_1 2025-05-07T20:26:14.1117310Z cuda-nvprof conda-forge/linux-64::cuda-nvprof-12.6.80-hbd13f7d_0 2025-05-07T20:26:14.1117829Z cuda-nvprune conda-forge/linux-64::cuda-nvprune-12.6.77-hbd13f7d_1 2025-05-07T20:26:14.1118343Z cuda-nvrtc conda-forge/linux-64::cuda-nvrtc-12.6.85-hbd13f7d_0 2025-05-07T20:26:14.1119022Z cuda-nvrtc-dev conda-forge/linux-64::cuda-nvrtc-dev-12.6.85-h5888daf_0 2025-05-07T20:26:14.1119712Z cuda-nvtx conda-forge/linux-64::cuda-nvtx-12.6.77-hbd13f7d_0 2025-05-07T20:26:14.1120240Z cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.6.85-ha770c72_0 2025-05-07T20:26:14.1120822Z cuda-nvvm-impl conda-forge/linux-64::cuda-nvvm-impl-12.6.85-he02047a_0 2025-05-07T20:26:14.1121397Z cuda-nvvm-tools conda-forge/linux-64::cuda-nvvm-tools-12.6.85-he02047a_0 2025-05-07T20:26:14.1122122Z cuda-nvvp conda-forge/linux-64::cuda-nvvp-12.6.80-hbd13f7d_1 2025-05-07T20:26:14.1122730Z cuda-opencl conda-forge/linux-64::cuda-opencl-12.6.77-hbd13f7d_0 2025-05-07T20:26:14.1123271Z cuda-opencl-dev conda-forge/linux-64::cuda-opencl-dev-12.6.77-h5888daf_0 2025-05-07T20:26:14.1123867Z cuda-profiler-api conda-forge/linux-64::cuda-profiler-api-12.6.77-h7938cbb_0 2025-05-07T20:26:14.1124430Z cuda-runtime conda-forge/noarch::cuda-runtime-12.6.3-ha804496_0 2025-05-07T20:26:14.1124992Z cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.6.77-hbd13f7d_1 2025-05-07T20:26:14.1125764Z cuda-toolkit conda-forge/noarch::cuda-toolkit-12.6.3-ha804496_0 2025-05-07T20:26:14.1126259Z cuda-tools conda-forge/linux-64::cuda-tools-12.6.3-ha770c72_0 2025-05-07T20:26:14.1126747Z cuda-version conda-forge/noarch::cuda-version-12.6-h7480c83_3 2025-05-07T20:26:14.1127286Z cuda-visual-tools conda-forge/linux-64::cuda-visual-tools-12.6.3-ha770c72_0 2025-05-07T20:26:14.1127852Z cxx-compiler conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 2025-05-07T20:26:14.1128315Z dbus conda-forge/linux-64::dbus-1.13.6-h5008d03_3 2025-05-07T20:26:14.1128839Z font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 2025-05-07T20:26:14.1129471Z font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 2025-05-07T20:26:14.1130266Z font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 2025-05-07T20:26:14.1130861Z font-ttf-ubuntu conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 2025-05-07T20:26:14.1131370Z fontconfig conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 2025-05-07T20:26:14.1131878Z fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 2025-05-07T20:26:14.1132382Z fonts-conda-forge conda-forge/noarch::fonts-conda-forge-1-0 2025-05-07T20:26:14.1132855Z freetype conda-forge/linux-64::freetype-2.13.3-ha770c72_1 2025-05-07T20:26:14.1133440Z gcc conda-forge/linux-64::gcc-11.4.0-h602e360_13 2025-05-07T20:26:14.1134358Z gds-tools conda-forge/linux-64::gds-tools-1.11.1.6-h5888daf_4 2025-05-07T20:26:14.1135034Z gmp conda-forge/linux-64::gmp-6.3.0-hac33072_2 2025-05-07T20:26:14.1135423Z gxx conda-forge/linux-64::gxx-11.4.0-h602e360_13 2025-05-07T20:26:14.1135841Z keyutils conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 2025-05-07T20:26:14.1136265Z krb5 conda-forge/linux-64::krb5-1.21.3-h659f571_0 2025-05-07T20:26:14.1136676Z libcap conda-forge/linux-64::libcap-2.71-h39aace5_0 2025-05-07T20:26:14.1137130Z libcublas conda-forge/linux-64::libcublas-12.6.4.1-h5888daf_1 2025-05-07T20:26:14.1137668Z libcublas-dev conda-forge/linux-64::libcublas-dev-12.6.4.1-h5888daf_1 2025-05-07T20:26:14.1138200Z libcufft conda-forge/linux-64::libcufft-11.3.0.4-hbd13f7d_0 2025-05-07T20:26:14.1138694Z libcufft-dev conda-forge/linux-64::libcufft-dev-11.3.0.4-h5888daf_0 2025-05-07T20:26:14.1139196Z libcufile conda-forge/linux-64::libcufile-1.11.1.6-h12f29b5_4 2025-05-07T20:26:14.1139706Z libcufile-dev conda-forge/linux-64::libcufile-dev-1.11.1.6-h5888daf_4 2025-05-07T20:26:14.1140223Z libcurand conda-forge/linux-64::libcurand-10.3.7.77-hbd13f7d_0 2025-05-07T20:26:14.1140747Z libcurand-dev conda-forge/linux-64::libcurand-dev-10.3.7.77-h5888daf_0 2025-05-07T20:26:14.1141274Z libcusolver conda-forge/linux-64::libcusolver-11.7.1.2-h5888daf_1 2025-05-07T20:26:14.1141823Z libcusolver-dev conda-forge/linux-64::libcusolver-dev-11.7.1.2-h5888daf_1 2025-05-07T20:26:14.1142375Z libcusparse conda-forge/linux-64::libcusparse-12.5.4.2-hbd13f7d_0 2025-05-07T20:26:14.1142920Z libcusparse-dev conda-forge/linux-64::libcusparse-dev-12.5.4.2-h5888daf_0 2025-05-07T20:26:14.1143443Z libedit conda-forge/linux-64::libedit-3.1.20191231-he28a2e2_2 2025-05-07T20:26:14.1143942Z libfreetype conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 2025-05-07T20:26:14.1144457Z libfreetype6 conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 2025-05-07T20:26:14.1144983Z libgcrypt-lib conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 2025-05-07T20:26:14.1145470Z libglib conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 2025-05-07T20:26:14.1146122Z libgpg-error conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 2025-05-07T20:26:14.1146783Z libiconv conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 2025-05-07T20:26:14.1147343Z libnl conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 2025-05-07T20:26:14.1147775Z libnpp conda-forge/linux-64::libnpp-12.3.1.54-h5888daf_0 2025-05-07T20:26:14.1148250Z libnpp-dev conda-forge/linux-64::libnpp-dev-12.3.1.54-h5888daf_0 2025-05-07T20:26:14.1148733Z libnuma conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 2025-05-07T20:26:14.1149211Z libnvfatbin conda-forge/linux-64::libnvfatbin-12.6.77-hbd13f7d_0 2025-05-07T20:26:14.1149755Z libnvfatbin-dev conda-forge/linux-64::libnvfatbin-dev-12.6.77-h5888daf_0 2025-05-07T20:26:14.1150306Z libnvjitlink conda-forge/linux-64::libnvjitlink-12.6.85-hbd13f7d_0 2025-05-07T20:26:14.1150865Z libnvjitlink-dev conda-forge/linux-64::libnvjitlink-dev-12.6.85-h5888daf_0 2025-05-07T20:26:14.1151533Z libnvjpeg conda-forge/linux-64::libnvjpeg-12.3.3.54-h5888daf_0 2025-05-07T20:26:14.1152068Z libnvjpeg-dev conda-forge/linux-64::libnvjpeg-dev-12.3.3.54-ha770c72_0 2025-05-07T20:26:14.1152570Z libpng conda-forge/linux-64::libpng-1.6.47-h943b412_0 2025-05-07T20:26:14.1153037Z libsystemd0 conda-forge/linux-64::libsystemd0-256.9-h2774228_0 2025-05-07T20:26:14.1153513Z libudev1 conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 2025-05-07T20:26:14.1153961Z libxcb conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 2025-05-07T20:26:14.1154557Z libxkbcommon conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 2025-05-07T20:26:14.1155067Z libxkbfile conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 2025-05-07T20:26:14.1155528Z libxml2 conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 2025-05-07T20:26:14.1155963Z lz4-c conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 2025-05-07T20:26:14.1156475Z nsight-compute conda-forge/linux-64::nsight-compute-2024.3.2.3-hb5ebaad_0 2025-05-07T20:26:14.1156978Z nspr conda-forge/linux-64::nspr-4.36-h5888daf_0 2025-05-07T20:26:14.1157368Z nss conda-forge/linux-64::nss-3.111-h159eef7_0 2025-05-07T20:26:14.1157832Z ocl-icd conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 2025-05-07T20:26:14.1158347Z opencl-headers conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 2025-05-07T20:26:14.1158857Z pcre2 conda-forge/linux-64::pcre2-10.44-hc749103_2 2025-05-07T20:26:14.1159343Z pthread-stubs conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 2025-05-07T20:26:14.1159863Z rdma-core conda-forge/linux-64::rdma-core-55.0-h5888daf_0 2025-05-07T20:26:14.1160320Z wayland conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 2025-05-07T20:26:14.1160765Z xcb-util conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 2025-05-07T20:26:14.1161292Z xcb-util-cursor conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 2025-05-07T20:26:14.1161849Z xcb-util-image conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 2025-05-07T20:26:14.1162412Z xcb-util-keysyms conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 2025-05-07T20:26:14.1163009Z xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 2025-05-07T20:26:14.1163573Z xcb-util-wm conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 2025-05-07T20:26:14.1164109Z xkeyboard-config conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 2025-05-07T20:26:14.1164729Z xorg-libice conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 2025-05-07T20:26:14.1165395Z xorg-libsm conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 2025-05-07T20:26:14.1166004Z xorg-libx11 conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 2025-05-07T20:26:14.1166502Z xorg-libxau conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 2025-05-07T20:26:14.1167074Z xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 2025-05-07T20:26:14.1167682Z xorg-libxdamage conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 2025-05-07T20:26:14.1168265Z xorg-libxdmcp conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 2025-05-07T20:26:14.1168790Z xorg-libxext conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 2025-05-07T20:26:14.1169311Z xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 2025-05-07T20:26:14.1169825Z xorg-libxi conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 2025-05-07T20:26:14.1170440Z xorg-libxrandr conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 2025-05-07T20:26:14.1171204Z xorg-libxrender conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 2025-05-07T20:26:14.1171780Z xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 2025-05-07T20:26:14.1172403Z zstd conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 2025-05-07T20:26:14.1172755Z 2025-05-07T20:26:14.1173073Z The following packages will be UPDATED: 2025-05-07T20:26:14.1173291Z 2025-05-07T20:26:14.1173464Z libsqlite 3.46.0-hde9e2c9_0 --> 3.49.2-hee588c1_0 2025-05-07T20:26:14.1173881Z libzlib 1.2.13-h4ab18f5_6 --> 1.3.1-hb9d3cd8_2 2025-05-07T20:26:14.1174281Z zlib 1.2.13-h4ab18f5_6 --> 1.3.1-hb9d3cd8_2 2025-05-07T20:26:14.1174533Z 2025-05-07T20:26:14.1174843Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T20:26:14.1175263Z 2025-05-07T20:26:14.1175536Z sqlite pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.32.3-hcee41ef_1 2025-05-07T20:26:14.1176134Z tk pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 2025-05-07T20:26:14.1176474Z 2025-05-07T20:26:14.1176503Z 2025-05-07T20:26:14.1176507Z 2025-05-07T20:26:14.1176656Z Downloading and Extracting Packages: ...working... 2025-05-07T20:26:14.1177048Z nsight-compute-2024. | 443.1 MB | | 0% 2025-05-07T20:26:14.1177293Z 2025-05-07T20:26:14.1177714Z libcublas-12.6.4.1 | 256.2 MB | | 0%  2025-05-07T20:26:14.1178004Z 2025-05-07T20:26:14.1178009Z 2025-05-07T20:26:14.1178225Z libcufft-11.3.0.4 | 156.2 MB | | 0%  2025-05-07T20:26:14.1178475Z 2025-05-07T20:26:14.1178479Z 2025-05-07T20:26:14.1178482Z 2025-05-07T20:26:14.1178706Z libcusparse-12.5.4.2 | 118.6 MB | | 0%  2025-05-07T20:26:14.1178974Z 2025-05-07T20:26:14.1178984Z 2025-05-07T20:26:14.1178988Z 2025-05-07T20:26:14.1178992Z 2025-05-07T20:26:14.1179224Z cuda-nsight-12.6.77 | 113.2 MB | | 0%  2025-05-07T20:26:14.1179498Z 2025-05-07T20:26:14.1179502Z 2025-05-07T20:26:14.1179505Z 2025-05-07T20:26:14.1179509Z 2025-05-07T20:26:14.1179512Z 2025-05-07T20:26:14.1188655Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:26:14.1189052Z 2025-05-07T20:26:14.1189058Z 2025-05-07T20:26:14.1189063Z 2025-05-07T20:26:14.1189068Z 2025-05-07T20:26:14.1189073Z 2025-05-07T20:26:14.1191922Z 2025-05-07T20:26:14.1197740Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:26:14.1198158Z 2025-05-07T20:26:14.1198163Z 2025-05-07T20:26:14.1198168Z 2025-05-07T20:26:14.1198174Z 2025-05-07T20:26:14.1198180Z 2025-05-07T20:26:14.1198185Z 2025-05-07T20:26:14.1198192Z 2025-05-07T20:26:14.1200136Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:26:14.1200553Z 2025-05-07T20:26:14.1200559Z 2025-05-07T20:26:14.1200564Z 2025-05-07T20:26:14.1200569Z 2025-05-07T20:26:14.1200574Z 2025-05-07T20:26:14.1200580Z 2025-05-07T20:26:14.1200585Z 2025-05-07T20:26:14.1200590Z 2025-05-07T20:26:14.1202201Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:26:14.1202615Z 2025-05-07T20:26:14.1202620Z 2025-05-07T20:26:14.1202625Z 2025-05-07T20:26:14.1202638Z 2025-05-07T20:26:14.1202643Z 2025-05-07T20:26:14.1202648Z 2025-05-07T20:26:14.1202653Z 2025-05-07T20:26:14.1202658Z 2025-05-07T20:26:14.1202663Z 2025-05-07T20:26:14.1204568Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:26:14.1204974Z 2025-05-07T20:26:14.1204979Z 2025-05-07T20:26:14.1204984Z 2025-05-07T20:26:14.1204989Z 2025-05-07T20:26:14.1204994Z 2025-05-07T20:26:14.1204999Z 2025-05-07T20:26:14.1205013Z 2025-05-07T20:26:14.1205019Z 2025-05-07T20:26:14.1205024Z 2025-05-07T20:26:14.1205028Z 2025-05-07T20:26:14.1206704Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:26:14.1207124Z 2025-05-07T20:26:14.1207129Z 2025-05-07T20:26:14.1207134Z 2025-05-07T20:26:14.1207139Z 2025-05-07T20:26:14.1207144Z 2025-05-07T20:26:14.1207149Z 2025-05-07T20:26:14.1207154Z 2025-05-07T20:26:14.1207159Z 2025-05-07T20:26:14.1207164Z 2025-05-07T20:26:14.1207169Z 2025-05-07T20:26:14.1207174Z 2025-05-07T20:26:14.1208016Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:26:14.1208349Z 2025-05-07T20:26:14.1208353Z 2025-05-07T20:26:14.1208357Z 2025-05-07T20:26:14.1208361Z 2025-05-07T20:26:14.1208364Z 2025-05-07T20:26:14.1208374Z 2025-05-07T20:26:14.1208378Z 2025-05-07T20:26:14.1208381Z 2025-05-07T20:26:14.1208384Z 2025-05-07T20:26:14.1208388Z 2025-05-07T20:26:14.1208392Z 2025-05-07T20:26:14.1208399Z 2025-05-07T20:26:14.1208835Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:26:14.1209234Z 2025-05-07T20:26:14.1209238Z 2025-05-07T20:26:14.1209242Z 2025-05-07T20:26:14.1209245Z 2025-05-07T20:26:14.1209249Z 2025-05-07T20:26:14.1209252Z 2025-05-07T20:26:14.1209256Z 2025-05-07T20:26:14.1209259Z 2025-05-07T20:26:14.1209263Z 2025-05-07T20:26:14.1209266Z 2025-05-07T20:26:14.1209273Z 2025-05-07T20:26:14.1209276Z 2025-05-07T20:26:14.1209280Z 2025-05-07T20:26:14.1210659Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:26:14.1211087Z 2025-05-07T20:26:14.1211092Z 2025-05-07T20:26:14.1211095Z 2025-05-07T20:26:14.1211099Z 2025-05-07T20:26:14.1211102Z 2025-05-07T20:26:14.1211106Z 2025-05-07T20:26:14.1211109Z 2025-05-07T20:26:14.1211113Z 2025-05-07T20:26:14.1211117Z 2025-05-07T20:26:14.1211120Z 2025-05-07T20:26:14.1211124Z 2025-05-07T20:26:14.1211137Z 2025-05-07T20:26:14.1211140Z 2025-05-07T20:26:14.1211144Z 2025-05-07T20:26:14.1211850Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:26:14.1212304Z 2025-05-07T20:26:14.1212310Z 2025-05-07T20:26:14.1212315Z 2025-05-07T20:26:14.1212319Z 2025-05-07T20:26:14.1212324Z 2025-05-07T20:26:14.1212329Z 2025-05-07T20:26:14.1212334Z 2025-05-07T20:26:14.1212339Z 2025-05-07T20:26:14.1212344Z 2025-05-07T20:26:14.1212359Z 2025-05-07T20:26:14.1212364Z 2025-05-07T20:26:14.1212369Z 2025-05-07T20:26:14.1212378Z 2025-05-07T20:26:14.1212389Z 2025-05-07T20:26:14.1212394Z 2025-05-07T20:26:14.1213969Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:26:14.1214429Z 2025-05-07T20:26:14.1214434Z 2025-05-07T20:26:14.1214439Z 2025-05-07T20:26:14.1214444Z 2025-05-07T20:26:14.1214449Z 2025-05-07T20:26:14.1214454Z 2025-05-07T20:26:14.1214459Z 2025-05-07T20:26:14.1214464Z 2025-05-07T20:26:14.1214469Z 2025-05-07T20:26:14.1214474Z 2025-05-07T20:26:14.1214479Z 2025-05-07T20:26:14.1214484Z 2025-05-07T20:26:14.1214488Z 2025-05-07T20:26:14.1214503Z 2025-05-07T20:26:14.1214507Z 2025-05-07T20:26:14.1214512Z 2025-05-07T20:26:14.1215494Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:26:14.1215944Z 2025-05-07T20:26:14.1215949Z 2025-05-07T20:26:14.1215962Z 2025-05-07T20:26:14.1215967Z 2025-05-07T20:26:14.1215972Z 2025-05-07T20:26:14.1215977Z 2025-05-07T20:26:14.1215981Z 2025-05-07T20:26:14.1215986Z 2025-05-07T20:26:14.1215998Z 2025-05-07T20:26:14.1216013Z 2025-05-07T20:26:14.1216018Z 2025-05-07T20:26:14.1216023Z 2025-05-07T20:26:14.1216028Z 2025-05-07T20:26:14.1216033Z 2025-05-07T20:26:14.1216038Z 2025-05-07T20:26:14.1216042Z 2025-05-07T20:26:14.1216598Z 2025-05-07T20:26:14.1218620Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:26:14.1219077Z 2025-05-07T20:26:14.1219083Z 2025-05-07T20:26:14.1219088Z 2025-05-07T20:26:14.1219092Z 2025-05-07T20:26:14.1219097Z 2025-05-07T20:26:14.1219102Z 2025-05-07T20:26:14.1219116Z 2025-05-07T20:26:14.1219121Z 2025-05-07T20:26:14.1219126Z 2025-05-07T20:26:14.1219131Z 2025-05-07T20:26:14.1219136Z 2025-05-07T20:26:14.1219141Z 2025-05-07T20:26:14.1219146Z 2025-05-07T20:26:14.1219151Z 2025-05-07T20:26:14.1219156Z 2025-05-07T20:26:14.1219161Z 2025-05-07T20:26:14.1219166Z 2025-05-07T20:26:14.1219171Z 2025-05-07T20:26:14.1220402Z libglib-2.84.0 | 3.8 MB | | 0%  2025-05-07T20:26:14.1220822Z 2025-05-07T20:26:14.1220827Z 2025-05-07T20:26:14.1220842Z 2025-05-07T20:26:14.1220857Z 2025-05-07T20:26:14.1220862Z 2025-05-07T20:26:14.1220867Z 2025-05-07T20:26:14.1220872Z 2025-05-07T20:26:14.1220877Z 2025-05-07T20:26:14.1220881Z 2025-05-07T20:26:14.1220886Z 2025-05-07T20:26:14.1220891Z 2025-05-07T20:26:14.1220896Z 2025-05-07T20:26:14.1220901Z 2025-05-07T20:26:14.1220906Z 2025-05-07T20:26:14.1220911Z 2025-05-07T20:26:14.1220916Z 2025-05-07T20:26:14.1220921Z 2025-05-07T20:26:14.1221016Z 2025-05-07T20:26:14.1221021Z 2025-05-07T20:26:14.2121606Z ... (more hidden) ... 2025-05-07T20:26:14.2122192Z 2025-05-07T20:26:14.2142622Z libcublas-12.6.4.1 | 256.2 MB | 1 | 1%  2025-05-07T20:26:14.2143010Z 2025-05-07T20:26:14.2143016Z 2025-05-07T20:26:14.2148089Z libcufft-11.3.0.4 | 156.2 MB | | 0%  2025-05-07T20:26:14.2148519Z 2025-05-07T20:26:14.2148526Z 2025-05-07T20:26:14.2148540Z 2025-05-07T20:26:14.2166762Z libcusparse-12.5.4.2 | 118.6 MB | 1 | 2%  2025-05-07T20:26:14.2167201Z 2025-05-07T20:26:14.2167207Z 2025-05-07T20:26:14.2167213Z 2025-05-07T20:26:14.2167218Z 2025-05-07T20:26:14.2188254Z cuda-nsight-12.6.77 | 113.2 MB | 1 | 1%  2025-05-07T20:26:14.3151974Z nsight-compute-2024. | 443.1 MB | | 0% 2025-05-07T20:26:14.3152384Z 2025-05-07T20:26:14.3152392Z 2025-05-07T20:26:14.3154525Z 2025-05-07T20:26:14.3171117Z libcusparse-12.5.4.2 | 118.6 MB | 4 | 5%  2025-05-07T20:26:14.3171473Z 2025-05-07T20:26:14.3171477Z 2025-05-07T20:26:14.3171480Z 2025-05-07T20:26:14.3171484Z 2025-05-07T20:26:14.3183210Z cuda-nsight-12.6.77 | 113.2 MB | 4 | 4%  2025-05-07T20:26:14.3183511Z 2025-05-07T20:26:14.3186007Z 2025-05-07T20:26:14.3193748Z libcufft-11.3.0.4 | 156.2 MB | 2 | 2%  2025-05-07T20:26:14.3209625Z nsight-compute-2024. | 443.1 MB | | 1% 2025-05-07T20:26:14.3211112Z 2025-05-07T20:26:14.4155698Z libcublas-12.6.4.1 | 256.2 MB | 2 | 3%  2025-05-07T20:26:14.4155996Z 2025-05-07T20:26:14.4156001Z 2025-05-07T20:26:14.4157754Z 2025-05-07T20:26:14.4174645Z libcusparse-12.5.4.2 | 118.6 MB | 7 | 8%  2025-05-07T20:26:14.4174946Z 2025-05-07T20:26:14.4174951Z 2025-05-07T20:26:14.4174954Z 2025-05-07T20:26:14.4177040Z 2025-05-07T20:26:14.4186602Z cuda-nsight-12.6.77 | 113.2 MB | 7 | 7%  2025-05-07T20:26:14.4188189Z 2025-05-07T20:26:14.4188194Z 2025-05-07T20:26:14.4195974Z libcufft-11.3.0.4 | 156.2 MB | 4 | 5%  2025-05-07T20:26:14.4289790Z nsight-compute-2024. | 443.1 MB | 1 | 1% 2025-05-07T20:26:14.4291266Z 2025-05-07T20:26:14.5157791Z libcublas-12.6.4.1 | 256.2 MB | 4 | 4%  2025-05-07T20:26:14.5158152Z 2025-05-07T20:26:14.5158169Z 2025-05-07T20:26:14.5158180Z 2025-05-07T20:26:14.5176048Z libcusparse-12.5.4.2 | 118.6 MB | # | 10%  2025-05-07T20:26:14.5176430Z 2025-05-07T20:26:14.5176436Z 2025-05-07T20:26:14.5176441Z 2025-05-07T20:26:14.5179795Z 2025-05-07T20:26:14.5188326Z cuda-nsight-12.6.77 | 113.2 MB | # | 10%  2025-05-07T20:26:14.5188670Z 2025-05-07T20:26:14.5189298Z 2025-05-07T20:26:14.5199380Z libcufft-11.3.0.4 | 156.2 MB | 6 | 7%  2025-05-07T20:26:14.5410140Z nsight-compute-2024. | 443.1 MB | 2 | 2% 2025-05-07T20:26:14.5411724Z 2025-05-07T20:26:14.6163123Z libcublas-12.6.4.1 | 256.2 MB | 5 | 6%  2025-05-07T20:26:14.6163413Z 2025-05-07T20:26:14.6163417Z 2025-05-07T20:26:14.6163421Z 2025-05-07T20:26:14.6178120Z libcusparse-12.5.4.2 | 118.6 MB | #3 | 13%  2025-05-07T20:26:14.6178551Z 2025-05-07T20:26:14.6178555Z 2025-05-07T20:26:14.6178559Z 2025-05-07T20:26:14.6178563Z 2025-05-07T20:26:14.6189012Z cuda-nsight-12.6.77 | 113.2 MB | #3 | 14%  2025-05-07T20:26:14.6189312Z 2025-05-07T20:26:14.6189944Z 2025-05-07T20:26:14.6201839Z libcufft-11.3.0.4 | 156.2 MB | 9 | 9%  2025-05-07T20:26:14.6419427Z nsight-compute-2024. | 443.1 MB | 2 | 3% 2025-05-07T20:26:14.6420487Z 2025-05-07T20:26:14.7164448Z libcublas-12.6.4.1 | 256.2 MB | 7 | 7%  2025-05-07T20:26:14.7164733Z 2025-05-07T20:26:14.7164737Z 2025-05-07T20:26:14.7164744Z 2025-05-07T20:26:14.7179492Z libcusparse-12.5.4.2 | 118.6 MB | #6 | 17%  2025-05-07T20:26:14.7179809Z 2025-05-07T20:26:14.7180096Z 2025-05-07T20:26:14.7180102Z 2025-05-07T20:26:14.7181782Z 2025-05-07T20:26:14.7188883Z cuda-nsight-12.6.77 | 113.2 MB | #6 | 17%  2025-05-07T20:26:14.7189177Z 2025-05-07T20:26:14.7190293Z 2025-05-07T20:26:14.7203222Z libcufft-11.3.0.4 | 156.2 MB | #1 | 12%  2025-05-07T20:26:14.7422192Z nsight-compute-2024. | 443.1 MB | 3 | 4% 2025-05-07T20:26:14.7425330Z 2025-05-07T20:26:14.8176477Z libcublas-12.6.4.1 | 256.2 MB | 8 | 9%  2025-05-07T20:26:14.8176866Z 2025-05-07T20:26:14.8176871Z 2025-05-07T20:26:14.8176882Z 2025-05-07T20:26:14.8182505Z libcusparse-12.5.4.2 | 118.6 MB | #9 | 20%  2025-05-07T20:26:14.8182815Z 2025-05-07T20:26:14.8182820Z 2025-05-07T20:26:14.8182825Z 2025-05-07T20:26:14.8185176Z 2025-05-07T20:26:14.8204778Z cuda-nsight-12.6.77 | 113.2 MB | ## | 20%  2025-05-07T20:26:14.8205077Z 2025-05-07T20:26:14.8205082Z 2025-05-07T20:26:14.8209842Z libcufft-11.3.0.4 | 156.2 MB | #3 | 14%  2025-05-07T20:26:14.8423073Z nsight-compute-2024. | 443.1 MB | 4 | 4% 2025-05-07T20:26:14.8424104Z 2025-05-07T20:26:14.9186948Z libcublas-12.6.4.1 | 256.2 MB | 9 | 10%  2025-05-07T20:26:14.9187248Z 2025-05-07T20:26:14.9187252Z 2025-05-07T20:26:14.9187256Z 2025-05-07T20:26:14.9188003Z 2025-05-07T20:26:14.9193734Z cuda-nsight-12.6.77 | 113.2 MB | ##3 | 23%  2025-05-07T20:26:14.9194023Z 2025-05-07T20:26:14.9194028Z 2025-05-07T20:26:14.9194031Z 2025-05-07T20:26:14.9205387Z libcusparse-12.5.4.2 | 118.6 MB | ##2 | 23%  2025-05-07T20:26:14.9205665Z 2025-05-07T20:26:14.9205669Z 2025-05-07T20:26:14.9209102Z libcufft-11.3.0.4 | 156.2 MB | #5 | 16%  2025-05-07T20:26:14.9466763Z nsight-compute-2024. | 443.1 MB | 5 | 5% 2025-05-07T20:26:14.9467084Z 2025-05-07T20:26:15.0195474Z libcublas-12.6.4.1 | 256.2 MB | #1 | 11%  2025-05-07T20:26:15.0195768Z 2025-05-07T20:26:15.0195801Z 2025-05-07T20:26:15.0198650Z 2025-05-07T20:26:15.0212128Z libcusparse-12.5.4.2 | 118.6 MB | ##6 | 26%  2025-05-07T20:26:15.0274653Z nsight-compute-2024. | 443.1 MB | 5 | 6% 2025-05-07T20:26:15.0275026Z 2025-05-07T20:26:15.0278010Z 2025-05-07T20:26:15.0316190Z libcufft-11.3.0.4 | 156.2 MB | #8 | 18%  2025-05-07T20:26:15.0316456Z 2025-05-07T20:26:15.0316478Z 2025-05-07T20:26:15.0316492Z 2025-05-07T20:26:15.0317488Z 2025-05-07T20:26:15.0508984Z cuda-nsight-12.6.77 | 113.2 MB | ##6 | 26%  2025-05-07T20:26:15.0511519Z 2025-05-07T20:26:15.1197200Z libcublas-12.6.4.1 | 256.2 MB | #2 | 13%  2025-05-07T20:26:15.1197483Z 2025-05-07T20:26:15.1197487Z 2025-05-07T20:26:15.1197843Z 2025-05-07T20:26:15.1219648Z libcusparse-12.5.4.2 | 118.6 MB | ##9 | 29%  2025-05-07T20:26:15.1277671Z nsight-compute-2024. | 443.1 MB | 6 | 7% 2025-05-07T20:26:15.1278039Z 2025-05-07T20:26:15.1279821Z 2025-05-07T20:26:15.1318642Z libcufft-11.3.0.4 | 156.2 MB | ## | 20%  2025-05-07T20:26:15.1319367Z 2025-05-07T20:26:15.1319372Z 2025-05-07T20:26:15.1319375Z 2025-05-07T20:26:15.1322471Z 2025-05-07T20:26:15.1511431Z cuda-nsight-12.6.77 | 113.2 MB | ##9 | 29%  2025-05-07T20:26:15.1512941Z 2025-05-07T20:26:15.2219858Z libcublas-12.6.4.1 | 256.2 MB | #4 | 14%  2025-05-07T20:26:15.2220480Z 2025-05-07T20:26:15.2220486Z 2025-05-07T20:26:15.2221334Z 2025-05-07T20:26:15.2223737Z libcusparse-12.5.4.2 | 118.6 MB | ###2 | 32%  2025-05-07T20:26:15.2284039Z nsight-compute-2024. | 443.1 MB | 7 | 8% 2025-05-07T20:26:15.2284519Z 2025-05-07T20:26:15.2286956Z 2025-05-07T20:26:15.2321156Z libcufft-11.3.0.4 | 156.2 MB | ##2 | 22%  2025-05-07T20:26:15.2321528Z 2025-05-07T20:26:15.2321532Z 2025-05-07T20:26:15.2321535Z 2025-05-07T20:26:15.2321539Z 2025-05-07T20:26:15.2516243Z cuda-nsight-12.6.77 | 113.2 MB | ###2 | 33%  2025-05-07T20:26:15.2518077Z 2025-05-07T20:26:15.3251050Z libcublas-12.6.4.1 | 256.2 MB | #5 | 15%  2025-05-07T20:26:15.3275029Z nsight-compute-2024. | 443.1 MB | 8 | 8% 2025-05-07T20:26:15.3275386Z 2025-05-07T20:26:15.3275392Z 2025-05-07T20:26:15.3275397Z 2025-05-07T20:26:15.3309389Z libcusparse-12.5.4.2 | 118.6 MB | ###5 | 35%  2025-05-07T20:26:15.3309736Z 2025-05-07T20:26:15.3310992Z 2025-05-07T20:26:15.3358433Z libcufft-11.3.0.4 | 156.2 MB | ##4 | 25%  2025-05-07T20:26:15.3358755Z 2025-05-07T20:26:15.3358759Z 2025-05-07T20:26:15.3358765Z 2025-05-07T20:26:15.3358769Z 2025-05-07T20:26:15.3604659Z cuda-nsight-12.6.77 | 113.2 MB | ###5 | 36%  2025-05-07T20:26:15.3607071Z 2025-05-07T20:26:15.4254699Z libcublas-12.6.4.1 | 256.2 MB | #6 | 17%  2025-05-07T20:26:15.4276584Z nsight-compute-2024. | 443.1 MB | 9 | 9% 2025-05-07T20:26:15.4276916Z 2025-05-07T20:26:15.4276947Z 2025-05-07T20:26:15.4276951Z 2025-05-07T20:26:15.4313029Z libcusparse-12.5.4.2 | 118.6 MB | ###8 | 38%  2025-05-07T20:26:15.4313375Z 2025-05-07T20:26:15.4314997Z 2025-05-07T20:26:15.4380158Z libcufft-11.3.0.4 | 156.2 MB | ##6 | 27%  2025-05-07T20:26:15.4380497Z 2025-05-07T20:26:15.4380503Z 2025-05-07T20:26:15.4380508Z 2025-05-07T20:26:15.4381682Z 2025-05-07T20:26:15.4606800Z cuda-nsight-12.6.77 | 113.2 MB | ###8 | 39%  2025-05-07T20:26:15.4608872Z 2025-05-07T20:26:15.5281596Z libcublas-12.6.4.1 | 256.2 MB | #8 | 18%  2025-05-07T20:26:15.5281888Z 2025-05-07T20:26:15.5281891Z 2025-05-07T20:26:15.5283507Z 2025-05-07T20:26:15.5337047Z libcusparse-12.5.4.2 | 118.6 MB | ####1 | 42%  2025-05-07T20:26:15.5368171Z nsight-compute-2024. | 443.1 MB | 9 | 10% 2025-05-07T20:26:15.5368457Z 2025-05-07T20:26:15.5368461Z 2025-05-07T20:26:15.5383556Z libcufft-11.3.0.4 | 156.2 MB | ##9 | 29%  2025-05-07T20:26:15.5383948Z 2025-05-07T20:26:15.5383952Z 2025-05-07T20:26:15.5383956Z 2025-05-07T20:26:15.5383960Z 2025-05-07T20:26:15.5607703Z cuda-nsight-12.6.77 | 113.2 MB | ####1 | 42%  2025-05-07T20:26:15.5610006Z 2025-05-07T20:26:15.6342881Z libcublas-12.6.4.1 | 256.2 MB | #9 | 20%  2025-05-07T20:26:15.6344767Z nsight-compute-2024. | 443.1 MB | # | 11% 2025-05-07T20:26:15.6345059Z 2025-05-07T20:26:15.6345064Z 2025-05-07T20:26:15.6346710Z 2025-05-07T20:26:15.6370220Z libcusparse-12.5.4.2 | 118.6 MB | ####4 | 45%  2025-05-07T20:26:15.6370669Z 2025-05-07T20:26:15.6370675Z 2025-05-07T20:26:15.6400554Z libcufft-11.3.0.4 | 156.2 MB | ###1 | 31%  2025-05-07T20:26:15.6400949Z 2025-05-07T20:26:15.6400953Z 2025-05-07T20:26:15.6400957Z 2025-05-07T20:26:15.6402313Z 2025-05-07T20:26:15.6682546Z cuda-nsight-12.6.77 | 113.2 MB | ####5 | 45%  2025-05-07T20:26:15.6682878Z 2025-05-07T20:26:15.7350661Z libcublas-12.6.4.1 | 256.2 MB | ## | 21%  2025-05-07T20:26:15.7372310Z nsight-compute-2024. | 443.1 MB | #1 | 11% 2025-05-07T20:26:15.7372691Z 2025-05-07T20:26:15.7372697Z 2025-05-07T20:26:15.7385709Z libcufft-11.3.0.4 | 156.2 MB | ###3 | 34%  2025-05-07T20:26:15.7386016Z 2025-05-07T20:26:15.7386020Z 2025-05-07T20:26:15.7387212Z 2025-05-07T20:26:15.7434990Z libcusparse-12.5.4.2 | 118.6 MB | ####8 | 48%  2025-05-07T20:26:15.7435310Z 2025-05-07T20:26:15.7435316Z 2025-05-07T20:26:15.7435321Z 2025-05-07T20:26:15.7439210Z 2025-05-07T20:26:15.7687243Z cuda-nsight-12.6.77 | 113.2 MB | ####8 | 48%  2025-05-07T20:26:15.7688824Z 2025-05-07T20:26:15.8372908Z libcublas-12.6.4.1 | 256.2 MB | ##2 | 22%  2025-05-07T20:26:15.8452221Z nsight-compute-2024. | 443.1 MB | #2 | 12% 2025-05-07T20:26:15.8452600Z 2025-05-07T20:26:15.8452607Z 2025-05-07T20:26:15.8457976Z libcufft-11.3.0.4 | 156.2 MB | ###6 | 36%  2025-05-07T20:26:15.8458635Z 2025-05-07T20:26:15.8458639Z 2025-05-07T20:26:15.8458643Z 2025-05-07T20:26:15.8467148Z libcusparse-12.5.4.2 | 118.6 MB | #####1 | 51%  2025-05-07T20:26:15.8467509Z 2025-05-07T20:26:15.8467513Z 2025-05-07T20:26:15.8467517Z 2025-05-07T20:26:15.8467521Z 2025-05-07T20:26:15.8731307Z cuda-nsight-12.6.77 | 113.2 MB | #####1 | 51%  2025-05-07T20:26:15.8731616Z 2025-05-07T20:26:15.9373832Z libcublas-12.6.4.1 | 256.2 MB | ##3 | 24%  2025-05-07T20:26:15.9458048Z nsight-compute-2024. | 443.1 MB | #2 | 13% 2025-05-07T20:26:15.9458420Z 2025-05-07T20:26:15.9458426Z 2025-05-07T20:26:15.9459621Z 2025-05-07T20:26:15.9477077Z libcusparse-12.5.4.2 | 118.6 MB | #####4 | 54%  2025-05-07T20:26:15.9477361Z 2025-05-07T20:26:15.9480088Z 2025-05-07T20:26:15.9731394Z libcufft-11.3.0.4 | 156.2 MB | ###8 | 38%  2025-05-07T20:26:15.9732859Z 2025-05-07T20:26:15.9975901Z libcublas-12.6.4.1 | 256.2 MB | ##5 | 25%  2025-05-07T20:26:15.9976208Z 2025-05-07T20:26:15.9976212Z 2025-05-07T20:26:15.9976216Z 2025-05-07T20:26:15.9976219Z 2025-05-07T20:26:16.0417727Z cuda-nsight-12.6.77 | 113.2 MB | #####4 | 54%  2025-05-07T20:26:16.0481644Z nsight-compute-2024. | 443.1 MB | #3 | 14% 2025-05-07T20:26:16.0481903Z 2025-05-07T20:26:16.0484460Z 2025-05-07T20:26:16.0492113Z libcufft-11.3.0.4 | 156.2 MB | #### | 41%  2025-05-07T20:26:16.0492382Z 2025-05-07T20:26:16.0492387Z 2025-05-07T20:26:16.0493904Z 2025-05-07T20:26:16.0771155Z libcusparse-12.5.4.2 | 118.6 MB | #####7 | 58%  2025-05-07T20:26:16.0771447Z 2025-05-07T20:26:16.0978314Z libcublas-12.6.4.1 | 256.2 MB | ##6 | 26%  2025-05-07T20:26:16.0978753Z 2025-05-07T20:26:16.0978759Z 2025-05-07T20:26:16.0978764Z 2025-05-07T20:26:16.0981091Z 2025-05-07T20:26:16.1418088Z cuda-nsight-12.6.77 | 113.2 MB | #####7 | 57%  2025-05-07T20:26:16.1483370Z nsight-compute-2024. | 443.1 MB | #4 | 15% 2025-05-07T20:26:16.1483728Z 2025-05-07T20:26:16.1485256Z 2025-05-07T20:26:16.1553024Z libcufft-11.3.0.4 | 156.2 MB | ####3 | 43%  2025-05-07T20:26:16.1553323Z 2025-05-07T20:26:16.1553327Z 2025-05-07T20:26:16.1553330Z 2025-05-07T20:26:16.1772165Z libcusparse-12.5.4.2 | 118.6 MB | ###### | 61%  2025-05-07T20:26:16.1772820Z 2025-05-07T20:26:16.1981974Z libcublas-12.6.4.1 | 256.2 MB | ##7 | 28%  2025-05-07T20:26:16.1982244Z 2025-05-07T20:26:16.1982248Z 2025-05-07T20:26:16.1982252Z 2025-05-07T20:26:16.1982894Z 2025-05-07T20:26:16.2486452Z cuda-nsight-12.6.77 | 113.2 MB | ###### | 61%  2025-05-07T20:26:16.2486829Z 2025-05-07T20:26:16.2489226Z 2025-05-07T20:26:16.2554525Z libcufft-11.3.0.4 | 156.2 MB | ####5 | 46%  2025-05-07T20:26:16.2554806Z 2025-05-07T20:26:16.2554809Z 2025-05-07T20:26:16.2555404Z 2025-05-07T20:26:16.2778644Z libcusparse-12.5.4.2 | 118.6 MB | ######4 | 64%  2025-05-07T20:26:16.2780075Z 2025-05-07T20:26:16.3019402Z libcublas-12.6.4.1 | 256.2 MB | ##9 | 29%  2025-05-07T20:26:16.3019690Z 2025-05-07T20:26:16.3019695Z 2025-05-07T20:26:16.3019698Z 2025-05-07T20:26:16.3021477Z 2025-05-07T20:26:16.3157114Z cuda-nsight-12.6.77 | 113.2 MB | ######3 | 64%  2025-05-07T20:26:16.3523904Z nsight-compute-2024. | 443.1 MB | #5 | 15% 2025-05-07T20:26:16.3524435Z 2025-05-07T20:26:16.3525931Z 2025-05-07T20:26:16.3635143Z libcufft-11.3.0.4 | 156.2 MB | ####7 | 48%  2025-05-07T20:26:16.3635518Z 2025-05-07T20:26:16.3635522Z 2025-05-07T20:26:16.3635525Z 2025-05-07T20:26:16.3864681Z libcusparse-12.5.4.2 | 118.6 MB | ######7 | 67%  2025-05-07T20:26:16.3865019Z 2025-05-07T20:26:16.4099847Z libcublas-12.6.4.1 | 256.2 MB | ### | 31%  2025-05-07T20:26:16.4100147Z 2025-05-07T20:26:16.4100164Z 2025-05-07T20:26:16.4100170Z 2025-05-07T20:26:16.4104624Z 2025-05-07T20:26:16.4159151Z cuda-nsight-12.6.77 | 113.2 MB | ######6 | 67%  2025-05-07T20:26:16.4524256Z nsight-compute-2024. | 443.1 MB | #6 | 16% 2025-05-07T20:26:16.4524601Z 2025-05-07T20:26:16.4528413Z 2025-05-07T20:26:16.4635501Z libcufft-11.3.0.4 | 156.2 MB | ##### | 50%  2025-05-07T20:26:16.4635776Z 2025-05-07T20:26:16.4635782Z 2025-05-07T20:26:16.4637204Z 2025-05-07T20:26:16.4866194Z libcusparse-12.5.4.2 | 118.6 MB | ####### | 71%  2025-05-07T20:26:16.4867641Z 2025-05-07T20:26:16.5101610Z libcublas-12.6.4.1 | 256.2 MB | ###2 | 32%  2025-05-07T20:26:16.5101891Z 2025-05-07T20:26:16.5101895Z 2025-05-07T20:26:16.5101899Z 2025-05-07T20:26:16.5102417Z 2025-05-07T20:26:16.5526050Z cuda-nsight-12.6.77 | 113.2 MB | ######9 | 70%  2025-05-07T20:26:16.5526380Z 2025-05-07T20:26:16.5526385Z 2025-05-07T20:26:16.5638817Z libcufft-11.3.0.4 | 156.2 MB | #####2 | 53%  2025-05-07T20:26:16.5639138Z 2025-05-07T20:26:16.5639169Z 2025-05-07T20:26:16.5639173Z 2025-05-07T20:26:16.5888716Z libcusparse-12.5.4.2 | 118.6 MB | #######4 | 74%  2025-05-07T20:26:16.5892809Z 2025-05-07T20:26:16.6103473Z libcublas-12.6.4.1 | 256.2 MB | ###3 | 33%  2025-05-07T20:26:16.6103760Z 2025-05-07T20:26:16.6103772Z 2025-05-07T20:26:16.6103776Z 2025-05-07T20:26:16.6106548Z 2025-05-07T20:26:16.6636401Z cuda-nsight-12.6.77 | 113.2 MB | #######2 | 73%  2025-05-07T20:26:16.6636808Z 2025-05-07T20:26:16.6638607Z 2025-05-07T20:26:16.6691448Z libcufft-11.3.0.4 | 156.2 MB | #####5 | 55%  2025-05-07T20:26:16.6691814Z 2025-05-07T20:26:16.6691819Z 2025-05-07T20:26:16.6691825Z 2025-05-07T20:26:16.6815223Z libcusparse-12.5.4.2 | 118.6 MB | #######7 | 77%  2025-05-07T20:26:16.6891567Z nsight-compute-2024. | 443.1 MB | #6 | 17% 2025-05-07T20:26:16.6892355Z 2025-05-07T20:26:16.7104392Z libcublas-12.6.4.1 | 256.2 MB | ###4 | 35%  2025-05-07T20:26:16.7104783Z 2025-05-07T20:26:16.7104822Z 2025-05-07T20:26:16.7104827Z 2025-05-07T20:26:16.7106463Z 2025-05-07T20:26:16.7646220Z cuda-nsight-12.6.77 | 113.2 MB | #######6 | 76%  2025-05-07T20:26:16.7646617Z 2025-05-07T20:26:16.7646623Z 2025-05-07T20:26:16.7893028Z libcufft-11.3.0.4 | 156.2 MB | #####7 | 58%  2025-05-07T20:26:16.7893484Z 2025-05-07T20:26:16.7922153Z libcublas-12.6.4.1 | 256.2 MB | ###6 | 36%  2025-05-07T20:26:16.7988936Z nsight-compute-2024. | 443.1 MB | #7 | 17% 2025-05-07T20:26:16.7989338Z 2025-05-07T20:26:16.7989354Z 2025-05-07T20:26:16.7991256Z 2025-05-07T20:26:16.8105448Z libcusparse-12.5.4.2 | 118.6 MB | ######## | 81%  2025-05-07T20:26:16.8105773Z 2025-05-07T20:26:16.8105777Z 2025-05-07T20:26:16.8105787Z 2025-05-07T20:26:16.8108791Z 2025-05-07T20:26:16.8647956Z cuda-nsight-12.6.77 | 113.2 MB | #######9 | 80%  2025-05-07T20:26:16.8648264Z 2025-05-07T20:26:16.8648279Z 2025-05-07T20:26:16.8893252Z libcufft-11.3.0.4 | 156.2 MB | ###### | 60%  2025-05-07T20:26:16.8895497Z 2025-05-07T20:26:16.9019304Z libcublas-12.6.4.1 | 256.2 MB | ###8 | 38%  2025-05-07T20:26:16.9019712Z 2025-05-07T20:26:16.9019787Z 2025-05-07T20:26:16.9019797Z 2025-05-07T20:26:16.9107492Z libcusparse-12.5.4.2 | 118.6 MB | ########3 | 84%  2025-05-07T20:26:16.9107847Z 2025-05-07T20:26:16.9107853Z 2025-05-07T20:26:16.9107859Z 2025-05-07T20:26:16.9110437Z 2025-05-07T20:26:16.9556734Z cuda-nsight-12.6.77 | 113.2 MB | ########3 | 83%  2025-05-07T20:26:16.9649123Z nsight-compute-2024. | 443.1 MB | #7 | 18% 2025-05-07T20:26:16.9649396Z 2025-05-07T20:26:16.9651331Z 2025-05-07T20:26:16.9896469Z libcufft-11.3.0.4 | 156.2 MB | ######3 | 63%  2025-05-07T20:26:16.9897563Z 2025-05-07T20:26:17.0019586Z libcublas-12.6.4.1 | 256.2 MB | ###9 | 40%  2025-05-07T20:26:17.0019911Z 2025-05-07T20:26:17.0019916Z 2025-05-07T20:26:17.0024990Z 2025-05-07T20:26:17.0167295Z libcusparse-12.5.4.2 | 118.6 MB | ########6 | 87%  2025-05-07T20:26:17.0167606Z 2025-05-07T20:26:17.0167610Z 2025-05-07T20:26:17.0167614Z 2025-05-07T20:26:17.0167617Z 2025-05-07T20:26:17.0791198Z cuda-nsight-12.6.77 | 113.2 MB | ########6 | 87%  2025-05-07T20:26:17.0791537Z 2025-05-07T20:26:17.0791741Z 2025-05-07T20:26:17.0807844Z libcufft-11.3.0.4 | 156.2 MB | ######5 | 66%  2025-05-07T20:26:17.0963990Z nsight-compute-2024. | 443.1 MB | #8 | 18% 2025-05-07T20:26:17.0964447Z 2025-05-07T20:26:17.1060437Z libcublas-12.6.4.1 | 256.2 MB | ####1 | 41%  2025-05-07T20:26:17.1060814Z 2025-05-07T20:26:17.1060818Z 2025-05-07T20:26:17.1065820Z 2025-05-07T20:26:17.1225166Z libcusparse-12.5.4.2 | 118.6 MB | ########9 | 90%  2025-05-07T20:26:17.1225768Z 2025-05-07T20:26:17.1225772Z 2025-05-07T20:26:17.1225776Z 2025-05-07T20:26:17.1226421Z 2025-05-07T20:26:17.1794071Z cuda-nsight-12.6.77 | 113.2 MB | ########9 | 90%  2025-05-07T20:26:17.1794412Z 2025-05-07T20:26:17.1796031Z 2025-05-07T20:26:17.1967622Z libcufft-11.3.0.4 | 156.2 MB | ######7 | 68%  2025-05-07T20:26:17.1968067Z 2025-05-07T20:26:17.2012363Z libcublas-12.6.4.1 | 256.2 MB | ####2 | 43%  2025-05-07T20:26:17.2061328Z nsight-compute-2024. | 443.1 MB | #8 | 19% 2025-05-07T20:26:17.2061712Z 2025-05-07T20:26:17.2061716Z 2025-05-07T20:26:17.2061747Z 2025-05-07T20:26:17.2228602Z libcusparse-12.5.4.2 | 118.6 MB | #########3 | 93%  2025-05-07T20:26:17.2228975Z 2025-05-07T20:26:17.2228979Z 2025-05-07T20:26:17.2228983Z 2025-05-07T20:26:17.2229646Z 2025-05-07T20:26:17.2794614Z cuda-nsight-12.6.77 | 113.2 MB | #########3 | 93%  2025-05-07T20:26:17.2794931Z 2025-05-07T20:26:17.2794948Z 2025-05-07T20:26:17.2970323Z libcufft-11.3.0.4 | 156.2 MB | ####### | 70%  2025-05-07T20:26:17.2972003Z 2025-05-07T20:26:17.3015823Z libcublas-12.6.4.1 | 256.2 MB | ####4 | 44%  2025-05-07T20:26:17.3064278Z nsight-compute-2024. | 443.1 MB | #9 | 20% 2025-05-07T20:26:17.3064642Z 2025-05-07T20:26:17.3064646Z 2025-05-07T20:26:17.3067047Z 2025-05-07T20:26:17.3228962Z libcusparse-12.5.4.2 | 118.6 MB | #########6 | 96%  2025-05-07T20:26:17.3229378Z 2025-05-07T20:26:17.3229384Z 2025-05-07T20:26:17.3229388Z 2025-05-07T20:26:17.3229394Z 2025-05-07T20:26:17.3803858Z cuda-nsight-12.6.77 | 113.2 MB | #########6 | 97%  2025-05-07T20:26:17.3804525Z 2025-05-07T20:26:17.3804531Z 2025-05-07T20:26:17.3973154Z libcufft-11.3.0.4 | 156.2 MB | #######2 | 73%  2025-05-07T20:26:17.3973462Z 2025-05-07T20:26:17.4018401Z libcublas-12.6.4.1 | 256.2 MB | ####5 | 46%  2025-05-07T20:26:17.4067982Z nsight-compute-2024. | 443.1 MB | ## | 20% 2025-05-07T20:26:17.4068373Z 2025-05-07T20:26:17.4068380Z 2025-05-07T20:26:17.4068385Z 2025-05-07T20:26:17.4804126Z libcusparse-12.5.4.2 | 118.6 MB | #########9 | 100%  2025-05-07T20:26:17.4804460Z 2025-05-07T20:26:17.4806651Z 2025-05-07T20:26:17.4973253Z libcufft-11.3.0.4 | 156.2 MB | #######5 | 76%  2025-05-07T20:26:17.4973622Z 2025-05-07T20:26:17.5024458Z libcublas-12.6.4.1 | 256.2 MB | ####7 | 47%  2025-05-07T20:26:17.5804071Z nsight-compute-2024. | 443.1 MB | ##1 | 21% 2025-05-07T20:26:17.5804344Z 2025-05-07T20:26:17.5804904Z 2025-05-07T20:26:17.5976690Z libcufft-11.3.0.4 | 156.2 MB | #######8 | 78%  2025-05-07T20:26:17.5978946Z 2025-05-07T20:26:17.6024862Z libcublas-12.6.4.1 | 256.2 MB | ####8 | 49%  2025-05-07T20:26:17.6805088Z nsight-compute-2024. | 443.1 MB | ##2 | 22% 2025-05-07T20:26:17.6805450Z 2025-05-07T20:26:17.6806617Z 2025-05-07T20:26:17.6984877Z libcufft-11.3.0.4 | 156.2 MB | ########1 | 81%  2025-05-07T20:26:17.6985151Z 2025-05-07T20:26:17.7030007Z libcublas-12.6.4.1 | 256.2 MB | ##### | 51%  2025-05-07T20:26:17.7805762Z nsight-compute-2024. | 443.1 MB | ##3 | 23% 2025-05-07T20:26:17.7806360Z 2025-05-07T20:26:17.7806481Z 2025-05-07T20:26:17.7986266Z libcufft-11.3.0.4 | 156.2 MB | ########4 | 84%  2025-05-07T20:26:17.7986895Z 2025-05-07T20:26:17.8133093Z libcublas-12.6.4.1 | 256.2 MB | #####2 | 52%  2025-05-07T20:26:17.8806789Z nsight-compute-2024. | 443.1 MB | ##4 | 24% 2025-05-07T20:26:17.8807101Z 2025-05-07T20:26:17.8809103Z 2025-05-07T20:26:17.8989079Z libcufft-11.3.0.4 | 156.2 MB | ########7 | 87%  2025-05-07T20:26:17.8989368Z 2025-05-07T20:26:17.9135099Z libcublas-12.6.4.1 | 256.2 MB | #####4 | 54%  2025-05-07T20:26:17.9806871Z nsight-compute-2024. | 443.1 MB | ##5 | 25% 2025-05-07T20:26:17.9807213Z 2025-05-07T20:26:17.9807217Z 2025-05-07T20:26:17.9988926Z libcufft-11.3.0.4 | 156.2 MB | ######### | 91%  2025-05-07T20:26:17.9990756Z 2025-05-07T20:26:18.0811227Z libcublas-12.6.4.1 | 256.2 MB | #####6 | 56%  2025-05-07T20:26:18.0811504Z 2025-05-07T20:26:18.0811843Z 2025-05-07T20:26:18.0991202Z libcufft-11.3.0.4 | 156.2 MB | #########4 | 95%  2025-05-07T20:26:18.0993809Z 2025-05-07T20:26:18.1132995Z libcublas-12.6.4.1 | 256.2 MB | #####7 | 58%  2025-05-07T20:26:18.1934188Z nsight-compute-2024. | 443.1 MB | ##5 | 26% 2025-05-07T20:26:18.1934469Z 2025-05-07T20:26:18.1935757Z 2025-05-07T20:26:18.2134696Z libcufft-11.3.0.4 | 156.2 MB | #########8 | 98%  2025-05-07T20:26:18.2146631Z nsight-compute-2024. | 443.1 MB | ##6 | 27% 2025-05-07T20:26:18.2150745Z 2025-05-07T20:26:18.3135847Z libcublas-12.6.4.1 | 256.2 MB | #####9 | 60%  2025-05-07T20:26:18.3147295Z nsight-compute-2024. | 443.1 MB | ##7 | 28% 2025-05-07T20:26:18.3148908Z 2025-05-07T20:26:18.4137852Z libcublas-12.6.4.1 | 256.2 MB | ######1 | 62%  2025-05-07T20:26:18.4148026Z nsight-compute-2024. | 443.1 MB | ##9 | 29% 2025-05-07T20:26:18.4148397Z 2025-05-07T20:26:18.5140096Z libcublas-12.6.4.1 | 256.2 MB | ######3 | 64%  2025-05-07T20:26:18.5150312Z nsight-compute-2024. | 443.1 MB | ### | 30% 2025-05-07T20:26:18.5150763Z 2025-05-07T20:26:18.6141107Z libcublas-12.6.4.1 | 256.2 MB | ######5 | 66%  2025-05-07T20:26:18.6152519Z nsight-compute-2024. | 443.1 MB | ###1 | 32% 2025-05-07T20:26:18.6152904Z 2025-05-07T20:26:18.7142856Z libcublas-12.6.4.1 | 256.2 MB | ######7 | 68%  2025-05-07T20:26:18.7172754Z nsight-compute-2024. | 443.1 MB | ###2 | 33% 2025-05-07T20:26:18.7173124Z 2025-05-07T20:26:18.8144360Z libcublas-12.6.4.1 | 256.2 MB | ######9 | 69%  2025-05-07T20:26:18.8202142Z nsight-compute-2024. | 443.1 MB | ###3 | 34% 2025-05-07T20:26:18.8204903Z 2025-05-07T20:26:18.9182476Z libcublas-12.6.4.1 | 256.2 MB | #######1 | 71%  2025-05-07T20:26:18.9205536Z nsight-compute-2024. | 443.1 MB | ###5 | 35% 2025-05-07T20:26:18.9207650Z 2025-05-07T20:26:19.0186350Z libcublas-12.6.4.1 | 256.2 MB | #######3 | 73%  2025-05-07T20:26:19.0205977Z nsight-compute-2024. | 443.1 MB | ###6 | 36% 2025-05-07T20:26:19.0207969Z 2025-05-07T20:26:19.1190379Z libcublas-12.6.4.1 | 256.2 MB | #######5 | 75%  2025-05-07T20:26:19.1207154Z nsight-compute-2024. | 443.1 MB | ###7 | 37% 2025-05-07T20:26:19.1208069Z 2025-05-07T20:26:19.2193530Z libcublas-12.6.4.1 | 256.2 MB | #######7 | 77%  2025-05-07T20:26:19.2207442Z nsight-compute-2024. | 443.1 MB | ###8 | 39% 2025-05-07T20:26:19.2208025Z 2025-05-07T20:26:19.3194737Z libcublas-12.6.4.1 | 256.2 MB | #######9 | 79%  2025-05-07T20:26:19.3214369Z nsight-compute-2024. | 443.1 MB | ###9 | 40% 2025-05-07T20:26:19.3215305Z 2025-05-07T20:26:19.4216631Z libcublas-12.6.4.1 | 256.2 MB | ########1 | 81%  2025-05-07T20:26:19.4262415Z nsight-compute-2024. | 443.1 MB | #### | 41% 2025-05-07T20:26:19.4264673Z 2025-05-07T20:26:19.5217537Z libcublas-12.6.4.1 | 256.2 MB | ########3 | 83%  2025-05-07T20:26:19.5317111Z nsight-compute-2024. | 443.1 MB | ####2 | 42% 2025-05-07T20:26:19.5318616Z 2025-05-07T20:26:19.6218002Z libcublas-12.6.4.1 | 256.2 MB | ########5 | 85%  2025-05-07T20:26:19.6319008Z nsight-compute-2024. | 443.1 MB | ####3 | 43% 2025-05-07T20:26:19.6321068Z 2025-05-07T20:26:19.6431851Z libcublas-12.6.4.1 | 256.2 MB | ########7 | 87%  2025-05-07T20:26:19.6432240Z 2025-05-07T20:26:19.6432245Z 2025-05-07T20:26:19.6432248Z 2025-05-07T20:26:19.6433380Z 2025-05-07T20:26:19.6815075Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:26:19.6815483Z 2025-05-07T20:26:19.6815487Z 2025-05-07T20:26:19.6818759Z 2025-05-07T20:26:19.6945032Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:26:19.6945329Z 2025-05-07T20:26:19.6945333Z 2025-05-07T20:26:19.6945336Z 2025-05-07T20:26:19.6945340Z 2025-05-07T20:26:19.6948415Z 2025-05-07T20:26:19.7518023Z cuda-nvvp-12.6.80 | 109.3 MB | | 0%  2025-05-07T20:26:19.7522159Z nsight-compute-2024. | 443.1 MB | ####4 | 45% 2025-05-07T20:26:19.7522519Z 2025-05-07T20:26:19.7522524Z 2025-05-07T20:26:19.7522527Z 2025-05-07T20:26:19.7522531Z 2025-05-07T20:26:19.7522534Z 2025-05-07T20:26:19.7525211Z 2025-05-07T20:26:19.7572734Z libcusolver-11.7.1.2 | 95.8 MB | | 0%  2025-05-07T20:26:19.7573188Z 2025-05-07T20:26:19.7946598Z libcublas-12.6.4.1 | 256.2 MB | ########9 | 89%  2025-05-07T20:26:19.7946880Z 2025-05-07T20:26:19.7946884Z 2025-05-07T20:26:19.7946888Z 2025-05-07T20:26:19.7946902Z 2025-05-07T20:26:19.7951033Z 2025-05-07T20:26:19.8525587Z cuda-nvvp-12.6.80 | 109.3 MB | 2 | 3%  2025-05-07T20:26:19.8525888Z 2025-05-07T20:26:19.8525892Z 2025-05-07T20:26:19.8525904Z 2025-05-07T20:26:19.8525908Z 2025-05-07T20:26:19.8525912Z 2025-05-07T20:26:19.8527890Z 2025-05-07T20:26:19.8953304Z libcusolver-11.7.1.2 | 95.8 MB | 2 | 3%  2025-05-07T20:26:19.8953817Z 2025-05-07T20:26:19.8953856Z 2025-05-07T20:26:19.8953862Z 2025-05-07T20:26:19.8953867Z 2025-05-07T20:26:19.8954708Z 2025-05-07T20:26:19.9114069Z cuda-nvvp-12.6.80 | 109.3 MB | 5 | 5%  2025-05-07T20:26:19.9172185Z nsight-compute-2024. | 443.1 MB | ####5 | 46% 2025-05-07T20:26:19.9172552Z 2025-05-07T20:26:19.9528447Z libcublas-12.6.4.1 | 256.2 MB | #########1 | 91%  2025-05-07T20:26:19.9528723Z 2025-05-07T20:26:19.9528754Z 2025-05-07T20:26:19.9528758Z 2025-05-07T20:26:19.9528761Z 2025-05-07T20:26:19.9528765Z 2025-05-07T20:26:19.9528941Z 2025-05-07T20:26:19.9954857Z libcusolver-11.7.1.2 | 95.8 MB | 5 | 5%  2025-05-07T20:26:19.9955228Z 2025-05-07T20:26:19.9955234Z 2025-05-07T20:26:19.9955239Z 2025-05-07T20:26:19.9955245Z 2025-05-07T20:26:19.9955250Z 2025-05-07T20:26:20.0529328Z cuda-nvvp-12.6.80 | 109.3 MB | 8 | 8%  2025-05-07T20:26:20.0529698Z 2025-05-07T20:26:20.0529704Z 2025-05-07T20:26:20.0529709Z 2025-05-07T20:26:20.0529748Z 2025-05-07T20:26:20.0529751Z 2025-05-07T20:26:20.0532082Z 2025-05-07T20:26:20.0619467Z libcusolver-11.7.1.2 | 95.8 MB | 8 | 8%  2025-05-07T20:26:20.0619970Z 2025-05-07T20:26:20.0765309Z libcublas-12.6.4.1 | 256.2 MB | #########2 | 93%  2025-05-07T20:26:20.0957049Z nsight-compute-2024. | 443.1 MB | ####6 | 47% 2025-05-07T20:26:20.0957414Z 2025-05-07T20:26:20.0957675Z 2025-05-07T20:26:20.0957680Z 2025-05-07T20:26:20.0957684Z 2025-05-07T20:26:20.0957687Z 2025-05-07T20:26:20.1535448Z cuda-nvvp-12.6.80 | 109.3 MB | # | 11%  2025-05-07T20:26:20.1535767Z 2025-05-07T20:26:20.1535771Z 2025-05-07T20:26:20.1535775Z 2025-05-07T20:26:20.1535779Z 2025-05-07T20:26:20.1535782Z 2025-05-07T20:26:20.1535786Z 2025-05-07T20:26:20.1973918Z libcusolver-11.7.1.2 | 95.8 MB | #1 | 11%  2025-05-07T20:26:20.1974233Z 2025-05-07T20:26:20.1974237Z 2025-05-07T20:26:20.1974241Z 2025-05-07T20:26:20.1974514Z 2025-05-07T20:26:20.1975278Z 2025-05-07T20:26:20.2102051Z cuda-nvvp-12.6.80 | 109.3 MB | #3 | 14%  2025-05-07T20:26:20.2102346Z 2025-05-07T20:26:20.2243190Z libcublas-12.6.4.1 | 256.2 MB | #########4 | 94%  2025-05-07T20:26:20.2541385Z nsight-compute-2024. | 443.1 MB | ####7 | 48% 2025-05-07T20:26:20.2541666Z 2025-05-07T20:26:20.2541671Z 2025-05-07T20:26:20.2541675Z 2025-05-07T20:26:20.2541703Z 2025-05-07T20:26:20.2541706Z 2025-05-07T20:26:20.2544341Z 2025-05-07T20:26:20.2980023Z libcusolver-11.7.1.2 | 95.8 MB | #4 | 14%  2025-05-07T20:26:20.2980347Z 2025-05-07T20:26:20.2980356Z 2025-05-07T20:26:20.2980359Z 2025-05-07T20:26:20.2980363Z 2025-05-07T20:26:20.2984347Z 2025-05-07T20:26:20.3341467Z cuda-nvvp-12.6.80 | 109.3 MB | #6 | 16%  2025-05-07T20:26:20.3341768Z 2025-05-07T20:26:20.3545599Z libcublas-12.6.4.1 | 256.2 MB | #########5 | 96%  2025-05-07T20:26:20.3545906Z 2025-05-07T20:26:20.3545935Z 2025-05-07T20:26:20.3545939Z 2025-05-07T20:26:20.3545942Z 2025-05-07T20:26:20.3545946Z 2025-05-07T20:26:20.3547067Z 2025-05-07T20:26:20.3567928Z libcusolver-11.7.1.2 | 95.8 MB | #7 | 17%  2025-05-07T20:26:20.4000996Z nsight-compute-2024. | 443.1 MB | ####8 | 48% 2025-05-07T20:26:20.4001365Z 2025-05-07T20:26:20.4001369Z 2025-05-07T20:26:20.4001382Z 2025-05-07T20:26:20.4001386Z 2025-05-07T20:26:20.4006077Z 2025-05-07T20:26:20.4458657Z cuda-nvvp-12.6.80 | 109.3 MB | #9 | 19%  2025-05-07T20:26:20.4461913Z 2025-05-07T20:26:20.4547724Z libcublas-12.6.4.1 | 256.2 MB | #########6 | 97%  2025-05-07T20:26:20.4548106Z 2025-05-07T20:26:20.4548109Z 2025-05-07T20:26:20.4548113Z 2025-05-07T20:26:20.4548116Z 2025-05-07T20:26:20.4548129Z 2025-05-07T20:26:20.4548134Z 2025-05-07T20:26:20.4829898Z libcusolver-11.7.1.2 | 95.8 MB | ## | 20%  2025-05-07T20:26:20.5226598Z nsight-compute-2024. | 443.1 MB | ####9 | 49% 2025-05-07T20:26:20.5226974Z 2025-05-07T20:26:20.5226978Z 2025-05-07T20:26:20.5226982Z 2025-05-07T20:26:20.5226985Z 2025-05-07T20:26:20.5235444Z 2025-05-07T20:26:20.5526225Z cuda-nvvp-12.6.80 | 109.3 MB | ##1 | 22%  2025-05-07T20:26:20.5528267Z 2025-05-07T20:26:20.5714943Z libcublas-12.6.4.1 | 256.2 MB | #########8 | 98%  2025-05-07T20:26:20.5715253Z 2025-05-07T20:26:20.5715279Z 2025-05-07T20:26:20.5715284Z 2025-05-07T20:26:20.5715287Z 2025-05-07T20:26:20.5715291Z 2025-05-07T20:26:20.5723171Z 2025-05-07T20:26:20.5920875Z libcusolver-11.7.1.2 | 95.8 MB | ##3 | 23%  2025-05-07T20:26:20.6228812Z nsight-compute-2024. | 443.1 MB | ####9 | 50% 2025-05-07T20:26:20.6229073Z 2025-05-07T20:26:20.6229304Z 2025-05-07T20:26:20.6229308Z 2025-05-07T20:26:20.6229311Z 2025-05-07T20:26:20.6229522Z 2025-05-07T20:26:20.6614782Z cuda-nvvp-12.6.80 | 109.3 MB | ##4 | 25%  2025-05-07T20:26:20.6617606Z 2025-05-07T20:26:20.6719419Z libcublas-12.6.4.1 | 256.2 MB | #########9 | 99%  2025-05-07T20:26:20.6719691Z 2025-05-07T20:26:20.6719695Z 2025-05-07T20:26:20.6719698Z 2025-05-07T20:26:20.6719702Z 2025-05-07T20:26:20.6719705Z 2025-05-07T20:26:20.6732791Z 2025-05-07T20:26:20.6926501Z libcusolver-11.7.1.2 | 95.8 MB | ##6 | 26%  2025-05-07T20:26:20.7292950Z nsight-compute-2024. | 443.1 MB | ##### | 50% 2025-05-07T20:26:20.7293223Z 2025-05-07T20:26:20.7293227Z 2025-05-07T20:26:20.7293231Z 2025-05-07T20:26:20.7293234Z 2025-05-07T20:26:20.7297159Z 2025-05-07T20:26:20.7791575Z cuda-nvvp-12.6.80 | 109.3 MB | ##7 | 27%  2025-05-07T20:26:20.7791980Z 2025-05-07T20:26:20.7791986Z 2025-05-07T20:26:20.7791991Z 2025-05-07T20:26:20.7791995Z 2025-05-07T20:26:20.7792000Z 2025-05-07T20:26:20.7793300Z 2025-05-07T20:26:20.7927761Z libcusolver-11.7.1.2 | 95.8 MB | ##9 | 29%  2025-05-07T20:26:20.8373754Z nsight-compute-2024. | 443.1 MB | #####1 | 51% 2025-05-07T20:26:20.8374343Z 2025-05-07T20:26:20.8374346Z 2025-05-07T20:26:20.8374350Z 2025-05-07T20:26:20.8374354Z 2025-05-07T20:26:20.8374361Z 2025-05-07T20:26:20.8828144Z cuda-nvvp-12.6.80 | 109.3 MB | ##9 | 30%  2025-05-07T20:26:20.8828437Z 2025-05-07T20:26:20.8828441Z 2025-05-07T20:26:20.8828444Z 2025-05-07T20:26:20.8828448Z 2025-05-07T20:26:20.8828452Z 2025-05-07T20:26:20.8829870Z 2025-05-07T20:26:20.8933693Z libcusolver-11.7.1.2 | 95.8 MB | ###2 | 32%  2025-05-07T20:26:20.9376286Z nsight-compute-2024. | 443.1 MB | #####1 | 52% 2025-05-07T20:26:20.9376669Z 2025-05-07T20:26:20.9376674Z 2025-05-07T20:26:20.9376679Z 2025-05-07T20:26:20.9376684Z 2025-05-07T20:26:20.9376689Z 2025-05-07T20:26:20.9843896Z cuda-nvvp-12.6.80 | 109.3 MB | ###2 | 32%  2025-05-07T20:26:20.9844300Z 2025-05-07T20:26:20.9844307Z 2025-05-07T20:26:20.9844312Z 2025-05-07T20:26:20.9844318Z 2025-05-07T20:26:20.9844351Z 2025-05-07T20:26:20.9850477Z 2025-05-07T20:26:20.9938058Z libcusolver-11.7.1.2 | 95.8 MB | ###5 | 36%  2025-05-07T20:26:21.0379141Z nsight-compute-2024. | 443.1 MB | #####2 | 53% 2025-05-07T20:26:21.0379534Z 2025-05-07T20:26:21.0379690Z 2025-05-07T20:26:21.0379696Z 2025-05-07T20:26:21.0379701Z 2025-05-07T20:26:21.0385352Z 2025-05-07T20:26:21.0854460Z cuda-nvvp-12.6.80 | 109.3 MB | ###5 | 35%  2025-05-07T20:26:21.0854995Z 2025-05-07T20:26:21.0855000Z 2025-05-07T20:26:21.0855005Z 2025-05-07T20:26:21.0855010Z 2025-05-07T20:26:21.0855015Z 2025-05-07T20:26:21.0855021Z 2025-05-07T20:26:21.0938975Z libcusolver-11.7.1.2 | 95.8 MB | ###8 | 39%  2025-05-07T20:26:21.1416833Z nsight-compute-2024. | 443.1 MB | #####3 | 53% 2025-05-07T20:26:21.1417218Z 2025-05-07T20:26:21.1417225Z 2025-05-07T20:26:21.1417230Z 2025-05-07T20:26:21.1417234Z 2025-05-07T20:26:21.1417239Z 2025-05-07T20:26:21.1855439Z cuda-nvvp-12.6.80 | 109.3 MB | ###7 | 38%  2025-05-07T20:26:21.1855823Z 2025-05-07T20:26:21.1855827Z 2025-05-07T20:26:21.1855830Z 2025-05-07T20:26:21.1855836Z 2025-05-07T20:26:21.1855841Z 2025-05-07T20:26:21.1855844Z 2025-05-07T20:26:21.1977861Z libcusolver-11.7.1.2 | 95.8 MB | ####1 | 42%  2025-05-07T20:26:21.2420465Z nsight-compute-2024. | 443.1 MB | #####4 | 54% 2025-05-07T20:26:21.2420734Z 2025-05-07T20:26:21.2420738Z 2025-05-07T20:26:21.2420741Z 2025-05-07T20:26:21.2420745Z 2025-05-07T20:26:21.2420751Z 2025-05-07T20:26:21.2859444Z cuda-nvvp-12.6.80 | 109.3 MB | #### | 41%  2025-05-07T20:26:21.2859739Z 2025-05-07T20:26:21.2859743Z 2025-05-07T20:26:21.2859746Z 2025-05-07T20:26:21.2859750Z 2025-05-07T20:26:21.2859753Z 2025-05-07T20:26:21.2859766Z 2025-05-07T20:26:21.3030547Z libcusolver-11.7.1.2 | 95.8 MB | ####4 | 45%  2025-05-07T20:26:21.3423306Z nsight-compute-2024. | 443.1 MB | #####4 | 55% 2025-05-07T20:26:21.3423626Z 2025-05-07T20:26:21.3423632Z 2025-05-07T20:26:21.3423637Z 2025-05-07T20:26:21.3423642Z 2025-05-07T20:26:21.3425081Z 2025-05-07T20:26:21.3861005Z cuda-nvvp-12.6.80 | 109.3 MB | ####3 | 44%  2025-05-07T20:26:21.3861435Z 2025-05-07T20:26:21.3861442Z 2025-05-07T20:26:21.3861446Z 2025-05-07T20:26:21.3861451Z 2025-05-07T20:26:21.3861457Z 2025-05-07T20:26:21.3861731Z 2025-05-07T20:26:21.4034862Z libcusolver-11.7.1.2 | 95.8 MB | ####8 | 48%  2025-05-07T20:26:21.4469727Z nsight-compute-2024. | 443.1 MB | #####5 | 55% 2025-05-07T20:26:21.4470085Z 2025-05-07T20:26:21.4470090Z 2025-05-07T20:26:21.4470093Z 2025-05-07T20:26:21.4470097Z 2025-05-07T20:26:21.4471061Z 2025-05-07T20:26:21.4862647Z cuda-nvvp-12.6.80 | 109.3 MB | ####6 | 47%  2025-05-07T20:26:21.4862944Z 2025-05-07T20:26:21.4862949Z 2025-05-07T20:26:21.4862952Z 2025-05-07T20:26:21.4862956Z 2025-05-07T20:26:21.4863243Z 2025-05-07T20:26:21.4863260Z 2025-05-07T20:26:21.5070018Z libcusolver-11.7.1.2 | 95.8 MB | #####1 | 52%  2025-05-07T20:26:21.5479561Z nsight-compute-2024. | 443.1 MB | #####6 | 56% 2025-05-07T20:26:21.5479824Z 2025-05-07T20:26:21.5479828Z 2025-05-07T20:26:21.5479833Z 2025-05-07T20:26:21.5479836Z 2025-05-07T20:26:21.5481771Z 2025-05-07T20:26:21.5995904Z cuda-nvvp-12.6.80 | 109.3 MB | ####9 | 49%  2025-05-07T20:26:21.5996259Z 2025-05-07T20:26:21.5996263Z 2025-05-07T20:26:21.5996266Z 2025-05-07T20:26:21.5996270Z 2025-05-07T20:26:21.5996273Z 2025-05-07T20:26:21.5996277Z 2025-05-07T20:26:21.6074260Z libcusolver-11.7.1.2 | 95.8 MB | #####5 | 55%  2025-05-07T20:26:21.6484842Z nsight-compute-2024. | 443.1 MB | #####6 | 57% 2025-05-07T20:26:21.6485117Z 2025-05-07T20:26:21.6485121Z 2025-05-07T20:26:21.6485125Z 2025-05-07T20:26:21.6485128Z 2025-05-07T20:26:21.6485132Z 2025-05-07T20:26:21.7002246Z cuda-nvvp-12.6.80 | 109.3 MB | #####2 | 53%  2025-05-07T20:26:21.7002574Z 2025-05-07T20:26:21.7002578Z 2025-05-07T20:26:21.7002582Z 2025-05-07T20:26:21.7002585Z 2025-05-07T20:26:21.7002589Z 2025-05-07T20:26:21.7002593Z 2025-05-07T20:26:21.7098534Z libcusolver-11.7.1.2 | 95.8 MB | #####8 | 59%  2025-05-07T20:26:21.7488039Z nsight-compute-2024. | 443.1 MB | #####7 | 58% 2025-05-07T20:26:21.7488446Z 2025-05-07T20:26:21.7488452Z 2025-05-07T20:26:21.7488457Z 2025-05-07T20:26:21.7488462Z 2025-05-07T20:26:21.7494244Z 2025-05-07T20:26:21.8025369Z cuda-nvvp-12.6.80 | 109.3 MB | #####5 | 55%  2025-05-07T20:26:21.8026003Z 2025-05-07T20:26:21.8026009Z 2025-05-07T20:26:21.8026013Z 2025-05-07T20:26:21.8026018Z 2025-05-07T20:26:21.8026023Z 2025-05-07T20:26:21.8026028Z 2025-05-07T20:26:21.8106122Z libcusolver-11.7.1.2 | 95.8 MB | ######1 | 62%  2025-05-07T20:26:21.8488458Z nsight-compute-2024. | 443.1 MB | #####8 | 58% 2025-05-07T20:26:21.8488833Z 2025-05-07T20:26:21.8488837Z 2025-05-07T20:26:21.8488841Z 2025-05-07T20:26:21.8488844Z 2025-05-07T20:26:21.8488975Z 2025-05-07T20:26:21.9109536Z cuda-nvvp-12.6.80 | 109.3 MB | #####8 | 58%  2025-05-07T20:26:21.9201506Z nsight-compute-2024. | 443.1 MB | #####9 | 59% 2025-05-07T20:26:21.9201780Z 2025-05-07T20:26:21.9201784Z 2025-05-07T20:26:21.9201788Z 2025-05-07T20:26:21.9201813Z 2025-05-07T20:26:21.9201817Z 2025-05-07T20:26:21.9205051Z 2025-05-07T20:26:21.9371492Z libcusolver-11.7.1.2 | 95.8 MB | ######4 | 65%  2025-05-07T20:26:21.9371801Z 2025-05-07T20:26:21.9376855Z 2025-05-07T20:26:21.9576305Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:26:21.9576589Z 2025-05-07T20:26:21.9576592Z 2025-05-07T20:26:21.9576596Z 2025-05-07T20:26:21.9576599Z 2025-05-07T20:26:21.9576603Z 2025-05-07T20:26:22.0238269Z cuda-nvvp-12.6.80 | 109.3 MB | ######1 | 61%  2025-05-07T20:26:22.0238713Z 2025-05-07T20:26:22.0238719Z 2025-05-07T20:26:22.0238724Z 2025-05-07T20:26:22.0238729Z 2025-05-07T20:26:22.0238734Z 2025-05-07T20:26:22.0238739Z 2025-05-07T20:26:22.0266510Z libcusolver-11.7.1.2 | 95.8 MB | ######8 | 68%  2025-05-07T20:26:22.0324951Z nsight-compute-2024. | 443.1 MB | #####9 | 60% 2025-05-07T20:26:22.0325209Z 2025-05-07T20:26:22.0325213Z 2025-05-07T20:26:22.0325782Z 2025-05-07T20:26:22.0325789Z 2025-05-07T20:26:22.0325793Z 2025-05-07T20:26:22.0325796Z 2025-05-07T20:26:22.0328744Z 2025-05-07T20:26:22.0715942Z libnpp-12.3.1.54 | 93.4 MB | | 0%  2025-05-07T20:26:22.0716238Z 2025-05-07T20:26:22.0716242Z 2025-05-07T20:26:22.0716245Z 2025-05-07T20:26:22.0716249Z 2025-05-07T20:26:22.0717615Z 2025-05-07T20:26:22.1307481Z cuda-nvvp-12.6.80 | 109.3 MB | ######4 | 64%  2025-05-07T20:26:22.1307800Z 2025-05-07T20:26:22.1307804Z 2025-05-07T20:26:22.1307807Z 2025-05-07T20:26:22.1308096Z 2025-05-07T20:26:22.1308100Z 2025-05-07T20:26:22.1309206Z 2025-05-07T20:26:22.1328229Z libcusolver-11.7.1.2 | 95.8 MB | #######1 | 71%  2025-05-07T20:26:22.1328541Z 2025-05-07T20:26:22.1328545Z 2025-05-07T20:26:22.1328548Z 2025-05-07T20:26:22.1328552Z 2025-05-07T20:26:22.1328556Z 2025-05-07T20:26:22.1328560Z 2025-05-07T20:26:22.1328891Z 2025-05-07T20:26:22.1516786Z libnpp-12.3.1.54 | 93.4 MB | 2 | 3%  2025-05-07T20:26:22.1841739Z nsight-compute-2024. | 443.1 MB | ###### | 61% 2025-05-07T20:26:22.1842097Z 2025-05-07T20:26:22.1842101Z 2025-05-07T20:26:22.1842105Z 2025-05-07T20:26:22.1842108Z 2025-05-07T20:26:22.1845534Z 2025-05-07T20:26:22.2335607Z cuda-nvvp-12.6.80 | 109.3 MB | ######6 | 67%  2025-05-07T20:26:22.2335983Z 2025-05-07T20:26:22.2335987Z 2025-05-07T20:26:22.2335990Z 2025-05-07T20:26:22.2335994Z 2025-05-07T20:26:22.2335997Z 2025-05-07T20:26:22.2336001Z 2025-05-07T20:26:22.2337504Z 2025-05-07T20:26:22.2420081Z libnpp-12.3.1.54 | 93.4 MB | 5 | 5%  2025-05-07T20:26:22.2420410Z 2025-05-07T20:26:22.2420416Z 2025-05-07T20:26:22.2420422Z 2025-05-07T20:26:22.2420427Z 2025-05-07T20:26:22.2420432Z 2025-05-07T20:26:22.2423635Z 2025-05-07T20:26:22.2622815Z libcusolver-11.7.1.2 | 95.8 MB | #######4 | 74%  2025-05-07T20:26:22.2989953Z nsight-compute-2024. | 443.1 MB | ######1 | 61% 2025-05-07T20:26:22.2990222Z 2025-05-07T20:26:22.2990226Z 2025-05-07T20:26:22.2990230Z 2025-05-07T20:26:22.2990233Z 2025-05-07T20:26:22.2991885Z 2025-05-07T20:26:22.3336692Z cuda-nvvp-12.6.80 | 109.3 MB | ######9 | 70%  2025-05-07T20:26:22.3337056Z 2025-05-07T20:26:22.3337060Z 2025-05-07T20:26:22.3337064Z 2025-05-07T20:26:22.3337067Z 2025-05-07T20:26:22.3337071Z 2025-05-07T20:26:22.3337074Z 2025-05-07T20:26:22.3337078Z 2025-05-07T20:26:22.3556103Z libnpp-12.3.1.54 | 93.4 MB | 8 | 8%  2025-05-07T20:26:22.3556520Z 2025-05-07T20:26:22.3556527Z 2025-05-07T20:26:22.3556532Z 2025-05-07T20:26:22.3556537Z 2025-05-07T20:26:22.3556542Z 2025-05-07T20:26:22.3560382Z 2025-05-07T20:26:22.3624731Z libcusolver-11.7.1.2 | 95.8 MB | #######6 | 77%  2025-05-07T20:26:22.3991371Z nsight-compute-2024. | 443.1 MB | ######1 | 62% 2025-05-07T20:26:22.3991741Z 2025-05-07T20:26:22.3991764Z 2025-05-07T20:26:22.3991768Z 2025-05-07T20:26:22.3991771Z 2025-05-07T20:26:22.3993557Z 2025-05-07T20:26:22.4341976Z cuda-nvvp-12.6.80 | 109.3 MB | #######2 | 72%  2025-05-07T20:26:22.4342303Z 2025-05-07T20:26:22.4342307Z 2025-05-07T20:26:22.4342311Z 2025-05-07T20:26:22.4342314Z 2025-05-07T20:26:22.4342318Z 2025-05-07T20:26:22.4342321Z 2025-05-07T20:26:22.4344091Z 2025-05-07T20:26:22.4582654Z libnpp-12.3.1.54 | 93.4 MB | #1 | 11%  2025-05-07T20:26:22.4582979Z 2025-05-07T20:26:22.4582982Z 2025-05-07T20:26:22.4583002Z 2025-05-07T20:26:22.4583006Z 2025-05-07T20:26:22.4583009Z 2025-05-07T20:26:22.4584475Z 2025-05-07T20:26:22.4700722Z libcusolver-11.7.1.2 | 95.8 MB | #######9 | 80%  2025-05-07T20:26:22.5138404Z nsight-compute-2024. | 443.1 MB | ######2 | 63% 2025-05-07T20:26:22.5138761Z 2025-05-07T20:26:22.5138765Z 2025-05-07T20:26:22.5138769Z 2025-05-07T20:26:22.5138772Z 2025-05-07T20:26:22.5142199Z 2025-05-07T20:26:22.5342287Z cuda-nvvp-12.6.80 | 109.3 MB | #######4 | 75%  2025-05-07T20:26:22.5342671Z 2025-05-07T20:26:22.5342675Z 2025-05-07T20:26:22.5342679Z 2025-05-07T20:26:22.5342683Z 2025-05-07T20:26:22.5342686Z 2025-05-07T20:26:22.5342690Z 2025-05-07T20:26:22.5342693Z 2025-05-07T20:26:22.5621332Z libnpp-12.3.1.54 | 93.4 MB | #4 | 14%  2025-05-07T20:26:22.5621631Z 2025-05-07T20:26:22.5621635Z 2025-05-07T20:26:22.5621639Z 2025-05-07T20:26:22.5621642Z 2025-05-07T20:26:22.5621646Z 2025-05-07T20:26:22.5622445Z 2025-05-07T20:26:22.5722234Z libcusolver-11.7.1.2 | 95.8 MB | ########2 | 82%  2025-05-07T20:26:22.6189722Z nsight-compute-2024. | 443.1 MB | ######3 | 63% 2025-05-07T20:26:22.6190105Z 2025-05-07T20:26:22.6190112Z 2025-05-07T20:26:22.6190118Z 2025-05-07T20:26:22.6190123Z 2025-05-07T20:26:22.6193280Z 2025-05-07T20:26:22.6348024Z cuda-nvvp-12.6.80 | 109.3 MB | #######7 | 77%  2025-05-07T20:26:22.6348461Z 2025-05-07T20:26:22.6348468Z 2025-05-07T20:26:22.6348473Z 2025-05-07T20:26:22.6348477Z 2025-05-07T20:26:22.6348482Z 2025-05-07T20:26:22.6348487Z 2025-05-07T20:26:22.6350530Z 2025-05-07T20:26:22.6652946Z libnpp-12.3.1.54 | 93.4 MB | #6 | 17%  2025-05-07T20:26:22.6653357Z 2025-05-07T20:26:22.6653362Z 2025-05-07T20:26:22.6653367Z 2025-05-07T20:26:22.6653372Z 2025-05-07T20:26:22.6653377Z 2025-05-07T20:26:22.6653386Z 2025-05-07T20:26:22.6744793Z libcusolver-11.7.1.2 | 95.8 MB | ########5 | 85%  2025-05-07T20:26:22.7192465Z nsight-compute-2024. | 443.1 MB | ######3 | 64% 2025-05-07T20:26:22.7192819Z 2025-05-07T20:26:22.7192833Z 2025-05-07T20:26:22.7192838Z 2025-05-07T20:26:22.7192844Z 2025-05-07T20:26:22.7195885Z 2025-05-07T20:26:22.7352104Z cuda-nvvp-12.6.80 | 109.3 MB | #######9 | 80%  2025-05-07T20:26:22.7352509Z 2025-05-07T20:26:22.7352514Z 2025-05-07T20:26:22.7352521Z 2025-05-07T20:26:22.7352545Z 2025-05-07T20:26:22.7352551Z 2025-05-07T20:26:22.7352556Z 2025-05-07T20:26:22.7355637Z 2025-05-07T20:26:22.7753047Z libnpp-12.3.1.54 | 93.4 MB | ## | 20%  2025-05-07T20:26:22.7756191Z nsight-compute-2024. | 443.1 MB | ######4 | 65% 2025-05-07T20:26:22.7756573Z 2025-05-07T20:26:22.7756579Z 2025-05-07T20:26:22.7756584Z 2025-05-07T20:26:22.7756589Z 2025-05-07T20:26:22.7756594Z 2025-05-07T20:26:22.7759259Z 2025-05-07T20:26:22.8198078Z libcusolver-11.7.1.2 | 95.8 MB | ########7 | 88%  2025-05-07T20:26:22.8198520Z 2025-05-07T20:26:22.8198525Z 2025-05-07T20:26:22.8198532Z 2025-05-07T20:26:22.8198537Z 2025-05-07T20:26:22.8201028Z 2025-05-07T20:26:22.8358985Z cuda-nvvp-12.6.80 | 109.3 MB | ########2 | 82%  2025-05-07T20:26:22.8359369Z 2025-05-07T20:26:22.8359373Z 2025-05-07T20:26:22.8359376Z 2025-05-07T20:26:22.8359380Z 2025-05-07T20:26:22.8359384Z 2025-05-07T20:26:22.8359388Z 2025-05-07T20:26:22.8359404Z 2025-05-07T20:26:22.8758900Z libnpp-12.3.1.54 | 93.4 MB | ##3 | 23%  2025-05-07T20:26:22.8759309Z 2025-05-07T20:26:22.8759315Z 2025-05-07T20:26:22.8759320Z 2025-05-07T20:26:22.8759325Z 2025-05-07T20:26:22.8759330Z 2025-05-07T20:26:22.8761176Z 2025-05-07T20:26:22.8821323Z libcusolver-11.7.1.2 | 95.8 MB | ######### | 91%  2025-05-07T20:26:22.9203085Z nsight-compute-2024. | 443.1 MB | ######5 | 65% 2025-05-07T20:26:22.9203461Z 2025-05-07T20:26:22.9203466Z 2025-05-07T20:26:22.9203472Z 2025-05-07T20:26:22.9203500Z 2025-05-07T20:26:22.9205089Z 2025-05-07T20:26:22.9388380Z cuda-nvvp-12.6.80 | 109.3 MB | ########4 | 85%  2025-05-07T20:26:22.9388897Z 2025-05-07T20:26:22.9388903Z 2025-05-07T20:26:22.9388908Z 2025-05-07T20:26:22.9388914Z 2025-05-07T20:26:22.9388919Z 2025-05-07T20:26:22.9388924Z 2025-05-07T20:26:22.9392719Z 2025-05-07T20:26:22.9791678Z libnpp-12.3.1.54 | 93.4 MB | ##5 | 26%  2025-05-07T20:26:22.9792091Z 2025-05-07T20:26:22.9792097Z 2025-05-07T20:26:22.9792102Z 2025-05-07T20:26:22.9792107Z 2025-05-07T20:26:22.9792112Z 2025-05-07T20:26:22.9794386Z 2025-05-07T20:26:22.9950746Z libcusolver-11.7.1.2 | 95.8 MB | #########3 | 93%  2025-05-07T20:26:23.0274231Z nsight-compute-2024. | 443.1 MB | ######5 | 66% 2025-05-07T20:26:23.0274570Z 2025-05-07T20:26:23.0274576Z 2025-05-07T20:26:23.0274581Z 2025-05-07T20:26:23.0274586Z 2025-05-07T20:26:23.0277875Z 2025-05-07T20:26:23.0395115Z cuda-nvvp-12.6.80 | 109.3 MB | ########7 | 87%  2025-05-07T20:26:23.0395674Z 2025-05-07T20:26:23.0395678Z 2025-05-07T20:26:23.0395681Z 2025-05-07T20:26:23.0395685Z 2025-05-07T20:26:23.0395688Z 2025-05-07T20:26:23.0395692Z 2025-05-07T20:26:23.0397157Z 2025-05-07T20:26:23.0802865Z libnpp-12.3.1.54 | 93.4 MB | ##8 | 29%  2025-05-07T20:26:23.0803257Z 2025-05-07T20:26:23.0803263Z 2025-05-07T20:26:23.0803285Z 2025-05-07T20:26:23.0803291Z 2025-05-07T20:26:23.0803296Z 2025-05-07T20:26:23.0806711Z 2025-05-07T20:26:23.1083868Z libcusolver-11.7.1.2 | 95.8 MB | #########5 | 96%  2025-05-07T20:26:23.1431242Z nsight-compute-2024. | 443.1 MB | ######6 | 66% 2025-05-07T20:26:23.1431624Z 2025-05-07T20:26:23.1431630Z 2025-05-07T20:26:23.1431636Z 2025-05-07T20:26:23.1431642Z 2025-05-07T20:26:23.1431648Z 2025-05-07T20:26:23.1431655Z 2025-05-07T20:26:23.1434304Z 2025-05-07T20:26:23.1459928Z libnpp-12.3.1.54 | 93.4 MB | ###1 | 32%  2025-05-07T20:26:23.1460405Z 2025-05-07T20:26:23.1460410Z 2025-05-07T20:26:23.1460415Z 2025-05-07T20:26:23.1460420Z 2025-05-07T20:26:23.1460425Z 2025-05-07T20:26:23.1807459Z cuda-nvvp-12.6.80 | 109.3 MB | ########9 | 90%  2025-05-07T20:26:23.1807864Z 2025-05-07T20:26:23.1807869Z 2025-05-07T20:26:23.1807874Z 2025-05-07T20:26:23.1807879Z 2025-05-07T20:26:23.1807884Z 2025-05-07T20:26:23.1809995Z 2025-05-07T20:26:23.2121671Z libcusolver-11.7.1.2 | 95.8 MB | #########8 | 99%  2025-05-07T20:26:23.2433347Z nsight-compute-2024. | 443.1 MB | ######6 | 67% 2025-05-07T20:26:23.2433731Z 2025-05-07T20:26:23.2433737Z 2025-05-07T20:26:23.2433742Z 2025-05-07T20:26:23.2433747Z 2025-05-07T20:26:23.2433752Z 2025-05-07T20:26:23.2433758Z 2025-05-07T20:26:23.2437176Z 2025-05-07T20:26:23.2505226Z libnpp-12.3.1.54 | 93.4 MB | ###4 | 35%  2025-05-07T20:26:23.2505524Z 2025-05-07T20:26:23.2505538Z 2025-05-07T20:26:23.2505556Z 2025-05-07T20:26:23.2505560Z 2025-05-07T20:26:23.2505563Z 2025-05-07T20:26:23.3129911Z cuda-nvvp-12.6.80 | 109.3 MB | #########1 | 92%  2025-05-07T20:26:23.3434048Z nsight-compute-2024. | 443.1 MB | ######7 | 68% 2025-05-07T20:26:23.3434360Z 2025-05-07T20:26:23.3434365Z 2025-05-07T20:26:23.3434369Z 2025-05-07T20:26:23.3434372Z 2025-05-07T20:26:23.3434376Z 2025-05-07T20:26:23.3434379Z 2025-05-07T20:26:23.3434402Z 2025-05-07T20:26:23.3509056Z libnpp-12.3.1.54 | 93.4 MB | ###7 | 38%  2025-05-07T20:26:23.3509362Z 2025-05-07T20:26:23.3509368Z 2025-05-07T20:26:23.3509374Z 2025-05-07T20:26:23.3509379Z 2025-05-07T20:26:23.3511336Z 2025-05-07T20:26:23.4130485Z cuda-nvvp-12.6.80 | 109.3 MB | #########4 | 94%  2025-05-07T20:26:23.4437157Z nsight-compute-2024. | 443.1 MB | ######8 | 68% 2025-05-07T20:26:23.4437462Z 2025-05-07T20:26:23.4437466Z 2025-05-07T20:26:23.4437470Z 2025-05-07T20:26:23.4437473Z 2025-05-07T20:26:23.4437496Z 2025-05-07T20:26:23.4437500Z 2025-05-07T20:26:23.4440878Z 2025-05-07T20:26:23.4511830Z libnpp-12.3.1.54 | 93.4 MB | ####1 | 41%  2025-05-07T20:26:23.4512216Z 2025-05-07T20:26:23.4512221Z 2025-05-07T20:26:23.4512224Z 2025-05-07T20:26:23.4512228Z 2025-05-07T20:26:23.4512231Z 2025-05-07T20:26:23.5137595Z cuda-nvvp-12.6.80 | 109.3 MB | #########6 | 97%  2025-05-07T20:26:23.5401454Z nsight-compute-2024. | 443.1 MB | ######8 | 69% 2025-05-07T20:26:23.5401836Z 2025-05-07T20:26:23.5401843Z 2025-05-07T20:26:23.5401848Z 2025-05-07T20:26:23.5404918Z 2025-05-07T20:26:23.5516083Z cuda-nsight-12.6.77 | 113.2 MB | ########## | 100%  2025-05-07T20:26:23.5516445Z 2025-05-07T20:26:23.5516451Z 2025-05-07T20:26:23.5516456Z 2025-05-07T20:26:23.5516462Z 2025-05-07T20:26:23.5520715Z 2025-05-07T20:26:23.5531702Z cuda-nvvp-12.6.80 | 109.3 MB | #########9 | 99%  2025-05-07T20:26:23.5532003Z 2025-05-07T20:26:23.5532284Z 2025-05-07T20:26:23.5532288Z 2025-05-07T20:26:23.5532291Z 2025-05-07T20:26:23.5532295Z 2025-05-07T20:26:23.5532298Z 2025-05-07T20:26:23.5532302Z 2025-05-07T20:26:23.6141404Z libnpp-12.3.1.54 | 93.4 MB | ####4 | 44%  2025-05-07T20:26:23.6538024Z nsight-compute-2024. | 443.1 MB | ######9 | 69% 2025-05-07T20:26:23.6538304Z 2025-05-07T20:26:23.6538308Z 2025-05-07T20:26:23.6538313Z 2025-05-07T20:26:23.6538333Z 2025-05-07T20:26:23.6538336Z 2025-05-07T20:26:23.6538340Z 2025-05-07T20:26:23.6540167Z 2025-05-07T20:26:23.7143164Z libnpp-12.3.1.54 | 93.4 MB | ####8 | 49%  2025-05-07T20:26:23.7542427Z nsight-compute-2024. | 443.1 MB | ####### | 70% 2025-05-07T20:26:23.7542793Z 2025-05-07T20:26:23.7542800Z 2025-05-07T20:26:23.7542805Z 2025-05-07T20:26:23.7542810Z 2025-05-07T20:26:23.7542815Z 2025-05-07T20:26:23.7542829Z 2025-05-07T20:26:23.7545723Z 2025-05-07T20:26:23.8144097Z libnpp-12.3.1.54 | 93.4 MB | #####3 | 53%  2025-05-07T20:26:23.8543621Z nsight-compute-2024. | 443.1 MB | #######1 | 71% 2025-05-07T20:26:23.8543892Z 2025-05-07T20:26:23.8543896Z 2025-05-07T20:26:23.8543899Z 2025-05-07T20:26:23.8543903Z 2025-05-07T20:26:23.8543906Z 2025-05-07T20:26:23.8543910Z 2025-05-07T20:26:23.8547635Z 2025-05-07T20:26:23.9146500Z libnpp-12.3.1.54 | 93.4 MB | #####7 | 58%  2025-05-07T20:26:23.9545958Z nsight-compute-2024. | 443.1 MB | #######2 | 72% 2025-05-07T20:26:23.9546230Z 2025-05-07T20:26:23.9546234Z 2025-05-07T20:26:23.9546237Z 2025-05-07T20:26:23.9546241Z 2025-05-07T20:26:23.9546244Z 2025-05-07T20:26:23.9546248Z 2025-05-07T20:26:23.9547875Z 2025-05-07T20:26:24.0147443Z libnpp-12.3.1.54 | 93.4 MB | ######1 | 62%  2025-05-07T20:26:24.0608692Z nsight-compute-2024. | 443.1 MB | #######2 | 73% 2025-05-07T20:26:24.0608983Z 2025-05-07T20:26:24.0608988Z 2025-05-07T20:26:24.0608993Z 2025-05-07T20:26:24.0609012Z 2025-05-07T20:26:24.0609017Z 2025-05-07T20:26:24.0609022Z 2025-05-07T20:26:24.0609214Z 2025-05-07T20:26:24.1156986Z libnpp-12.3.1.54 | 93.4 MB | ######5 | 65%  2025-05-07T20:26:24.1757660Z nsight-compute-2024. | 443.1 MB | #######3 | 74% 2025-05-07T20:26:24.1757985Z 2025-05-07T20:26:24.1757989Z 2025-05-07T20:26:24.1757993Z 2025-05-07T20:26:24.1757996Z 2025-05-07T20:26:24.1758000Z 2025-05-07T20:26:24.1758018Z 2025-05-07T20:26:24.1758022Z 2025-05-07T20:26:24.2168178Z libnpp-12.3.1.54 | 93.4 MB | ######9 | 69%  2025-05-07T20:26:24.3146634Z nsight-compute-2024. | 443.1 MB | #######4 | 74% 2025-05-07T20:26:24.3146931Z 2025-05-07T20:26:24.3146937Z 2025-05-07T20:26:24.3146942Z 2025-05-07T20:26:24.3146951Z 2025-05-07T20:26:24.3146956Z 2025-05-07T20:26:24.3146963Z 2025-05-07T20:26:24.3155224Z 2025-05-07T20:26:24.3172313Z libnpp-12.3.1.54 | 93.4 MB | #######2 | 73%  2025-05-07T20:26:24.4149875Z nsight-compute-2024. | 443.1 MB | #######5 | 75% 2025-05-07T20:26:24.4150152Z 2025-05-07T20:26:24.4150157Z 2025-05-07T20:26:24.4150161Z 2025-05-07T20:26:24.4150165Z 2025-05-07T20:26:24.4150168Z 2025-05-07T20:26:24.4150172Z 2025-05-07T20:26:24.4152646Z 2025-05-07T20:26:24.4251484Z libnpp-12.3.1.54 | 93.4 MB | #######6 | 77%  2025-05-07T20:26:24.5150409Z nsight-compute-2024. | 443.1 MB | #######6 | 76% 2025-05-07T20:26:24.5150705Z 2025-05-07T20:26:24.5150709Z 2025-05-07T20:26:24.5150713Z 2025-05-07T20:26:24.5150716Z 2025-05-07T20:26:24.5150720Z 2025-05-07T20:26:24.5150724Z 2025-05-07T20:26:24.5150727Z 2025-05-07T20:26:24.5281968Z libnpp-12.3.1.54 | 93.4 MB | ######## | 80%  2025-05-07T20:26:24.6191397Z nsight-compute-2024. | 443.1 MB | #######6 | 77% 2025-05-07T20:26:24.6191753Z 2025-05-07T20:26:24.6191759Z 2025-05-07T20:26:24.6191764Z 2025-05-07T20:26:24.6191770Z 2025-05-07T20:26:24.6191775Z 2025-05-07T20:26:24.6192092Z 2025-05-07T20:26:24.6192099Z 2025-05-07T20:26:24.6282051Z libnpp-12.3.1.54 | 93.4 MB | ########3 | 84%  2025-05-07T20:26:24.7285256Z nsight-compute-2024. | 443.1 MB | #######7 | 78% 2025-05-07T20:26:24.7942852Z nsight-compute-2024. | 443.1 MB | #######8 | 79% 2025-05-07T20:26:24.7943246Z 2025-05-07T20:26:24.7943251Z 2025-05-07T20:26:24.7943256Z 2025-05-07T20:26:24.7943261Z 2025-05-07T20:26:24.7943288Z 2025-05-07T20:26:24.7943293Z 2025-05-07T20:26:24.7946698Z 2025-05-07T20:26:24.8284615Z libnpp-12.3.1.54 | 93.4 MB | ########7 | 88%  2025-05-07T20:26:24.8953464Z nsight-compute-2024. | 443.1 MB | #######9 | 79% 2025-05-07T20:26:24.8953745Z 2025-05-07T20:26:24.8953749Z 2025-05-07T20:26:24.8953752Z 2025-05-07T20:26:24.8953756Z 2025-05-07T20:26:24.8953759Z 2025-05-07T20:26:24.8953764Z 2025-05-07T20:26:24.8953768Z 2025-05-07T20:26:24.9400786Z libnpp-12.3.1.54 | 93.4 MB | #########1 | 91%  2025-05-07T20:26:24.9953983Z nsight-compute-2024. | 443.1 MB | ######## | 80% 2025-05-07T20:26:24.9954255Z 2025-05-07T20:26:24.9954493Z 2025-05-07T20:26:24.9954502Z 2025-05-07T20:26:24.9954509Z 2025-05-07T20:26:24.9954515Z 2025-05-07T20:26:24.9954520Z 2025-05-07T20:26:24.9958650Z 2025-05-07T20:26:25.0404103Z libnpp-12.3.1.54 | 93.4 MB | #########4 | 95%  2025-05-07T20:26:25.1404762Z nsight-compute-2024. | 443.1 MB | ########1 | 81% 2025-05-07T20:26:25.1439352Z nsight-compute-2024. | 443.1 MB | ########2 | 82% 2025-05-07T20:26:25.1439615Z 2025-05-07T20:26:25.1439897Z 2025-05-07T20:26:25.1439901Z 2025-05-07T20:26:25.1439905Z 2025-05-07T20:26:25.1440072Z 2025-05-07T20:26:25.1440078Z 2025-05-07T20:26:25.1440496Z 2025-05-07T20:26:25.2417264Z libnpp-12.3.1.54 | 93.4 MB | #########8 | 98%  2025-05-07T20:26:25.3419966Z nsight-compute-2024. | 443.1 MB | ########2 | 83% 2025-05-07T20:26:25.4420692Z nsight-compute-2024. | 443.1 MB | ########3 | 84% 2025-05-07T20:26:25.5422574Z nsight-compute-2024. | 443.1 MB | ########4 | 85% 2025-05-07T20:26:25.6426162Z nsight-compute-2024. | 443.1 MB | ########5 | 86% 2025-05-07T20:26:25.7429833Z nsight-compute-2024. | 443.1 MB | ########6 | 87% 2025-05-07T20:26:25.8437424Z nsight-compute-2024. | 443.1 MB | ########7 | 88% 2025-05-07T20:26:25.9461465Z nsight-compute-2024. | 443.1 MB | ########8 | 89% 2025-05-07T20:26:25.9762646Z nsight-compute-2024. | 443.1 MB | ########9 | 90% 2025-05-07T20:26:25.9763023Z 2025-05-07T20:26:25.9763029Z 2025-05-07T20:26:25.9763034Z 2025-05-07T20:26:25.9763048Z 2025-05-07T20:26:25.9763053Z 2025-05-07T20:26:25.9763058Z 2025-05-07T20:26:26.0182836Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:26:26.0183238Z 2025-05-07T20:26:26.0183244Z 2025-05-07T20:26:26.0183258Z 2025-05-07T20:26:26.0183264Z 2025-05-07T20:26:26.0183269Z 2025-05-07T20:26:26.0183274Z 2025-05-07T20:26:26.0183278Z 2025-05-07T20:26:26.0183308Z 2025-05-07T20:26:26.0522427Z cuda-nvdisasm-12.6.7 | 47.6 MB | | 0%  2025-05-07T20:26:26.1186595Z nsight-compute-2024. | 443.1 MB | ######### | 91% 2025-05-07T20:26:26.1186955Z 2025-05-07T20:26:26.1186961Z 2025-05-07T20:26:26.1186966Z 2025-05-07T20:26:26.1186971Z 2025-05-07T20:26:26.1186984Z 2025-05-07T20:26:26.1186990Z 2025-05-07T20:26:26.1186995Z 2025-05-07T20:26:26.1187000Z 2025-05-07T20:26:26.1766784Z cuda-nvdisasm-12.6.7 | 47.6 MB | 6 | 7%  2025-05-07T20:26:26.2241225Z nsight-compute-2024. | 443.1 MB | #########1 | 92% 2025-05-07T20:26:26.2241604Z 2025-05-07T20:26:26.2241610Z 2025-05-07T20:26:26.2241615Z 2025-05-07T20:26:26.2241621Z 2025-05-07T20:26:26.2241626Z 2025-05-07T20:26:26.2241631Z 2025-05-07T20:26:26.2241652Z 2025-05-07T20:26:26.2246620Z 2025-05-07T20:26:26.2966001Z cuda-nvdisasm-12.6.7 | 47.6 MB | #3 | 14%  2025-05-07T20:26:26.3318821Z nsight-compute-2024. | 443.1 MB | #########2 | 92% 2025-05-07T20:26:26.3319501Z 2025-05-07T20:26:26.3319507Z 2025-05-07T20:26:26.3319512Z 2025-05-07T20:26:26.3319517Z 2025-05-07T20:26:26.3319522Z 2025-05-07T20:26:26.3319527Z 2025-05-07T20:26:26.3319532Z 2025-05-07T20:26:26.3321631Z 2025-05-07T20:26:26.4247632Z cuda-nvdisasm-12.6.7 | 47.6 MB | ## | 21%  2025-05-07T20:26:26.4320279Z nsight-compute-2024. | 443.1 MB | #########3 | 93% 2025-05-07T20:26:26.4320650Z 2025-05-07T20:26:26.4320790Z 2025-05-07T20:26:26.4320796Z 2025-05-07T20:26:26.4320805Z 2025-05-07T20:26:26.4320810Z 2025-05-07T20:26:26.4320815Z 2025-05-07T20:26:26.4320820Z 2025-05-07T20:26:26.4324869Z 2025-05-07T20:26:26.5321344Z cuda-nvdisasm-12.6.7 | 47.6 MB | ##7 | 27%  2025-05-07T20:26:26.5321781Z 2025-05-07T20:26:26.5321794Z 2025-05-07T20:26:26.5321798Z 2025-05-07T20:26:26.5321802Z 2025-05-07T20:26:26.5321805Z 2025-05-07T20:26:26.5321809Z 2025-05-07T20:26:26.5321814Z 2025-05-07T20:26:26.5321842Z 2025-05-07T20:26:26.5381950Z cuda-nvdisasm-12.6.7 | 47.6 MB | ###3 | 34%  2025-05-07T20:26:26.6450896Z nsight-compute-2024. | 443.1 MB | #########4 | 94% 2025-05-07T20:26:26.6451327Z 2025-05-07T20:26:26.6451333Z 2025-05-07T20:26:26.6451338Z 2025-05-07T20:26:26.6451343Z 2025-05-07T20:26:26.6451360Z 2025-05-07T20:26:26.6451366Z 2025-05-07T20:26:26.6451371Z 2025-05-07T20:26:26.6451416Z 2025-05-07T20:26:26.6605450Z cuda-nvdisasm-12.6.7 | 47.6 MB | #### | 41%  2025-05-07T20:26:26.7477165Z nsight-compute-2024. | 443.1 MB | #########4 | 95% 2025-05-07T20:26:26.7477530Z 2025-05-07T20:26:26.7477536Z 2025-05-07T20:26:26.7477541Z 2025-05-07T20:26:26.7477546Z 2025-05-07T20:26:26.7477551Z 2025-05-07T20:26:26.7477557Z 2025-05-07T20:26:26.7477562Z 2025-05-07T20:26:26.7479316Z 2025-05-07T20:26:26.7695193Z cuda-nvdisasm-12.6.7 | 47.6 MB | ####7 | 47%  2025-05-07T20:26:26.8483159Z nsight-compute-2024. | 443.1 MB | #########5 | 96% 2025-05-07T20:26:26.8483571Z 2025-05-07T20:26:26.8483577Z 2025-05-07T20:26:26.8483582Z 2025-05-07T20:26:26.8483587Z 2025-05-07T20:26:26.8483592Z 2025-05-07T20:26:26.8483597Z 2025-05-07T20:26:26.8483602Z 2025-05-07T20:26:26.8486118Z 2025-05-07T20:26:26.8744717Z cuda-nvdisasm-12.6.7 | 47.6 MB | #####3 | 54%  2025-05-07T20:26:26.8895498Z nsight-compute-2024. | 443.1 MB | #########6 | 96% 2025-05-07T20:26:26.8895852Z 2025-05-07T20:26:26.8895867Z 2025-05-07T20:26:26.8895873Z 2025-05-07T20:26:26.8895878Z 2025-05-07T20:26:26.8898689Z 2025-05-07T20:26:26.9240655Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:26:26.9241086Z 2025-05-07T20:26:26.9241093Z 2025-05-07T20:26:26.9241099Z 2025-05-07T20:26:26.9241105Z 2025-05-07T20:26:26.9241113Z 2025-05-07T20:26:26.9241119Z 2025-05-07T20:26:26.9241126Z 2025-05-07T20:26:26.9241132Z 2025-05-07T20:26:26.9241137Z 2025-05-07T20:26:26.9484180Z libcurand-10.3.7.77 | 39.9 MB | | 0%  2025-05-07T20:26:26.9484608Z 2025-05-07T20:26:26.9484613Z 2025-05-07T20:26:26.9484618Z 2025-05-07T20:26:26.9484629Z 2025-05-07T20:26:26.9484634Z 2025-05-07T20:26:26.9484639Z 2025-05-07T20:26:26.9484644Z 2025-05-07T20:26:26.9487123Z 2025-05-07T20:26:26.9799963Z cuda-nvdisasm-12.6.7 | 47.6 MB | ###### | 60%  2025-05-07T20:26:27.0240585Z nsight-compute-2024. | 443.1 MB | #########6 | 97% 2025-05-07T20:26:27.0240961Z 2025-05-07T20:26:27.0240967Z 2025-05-07T20:26:27.0240972Z 2025-05-07T20:26:27.0240978Z 2025-05-07T20:26:27.0240985Z 2025-05-07T20:26:27.0240991Z 2025-05-07T20:26:27.0240997Z 2025-05-07T20:26:27.0241004Z 2025-05-07T20:26:27.0243872Z 2025-05-07T20:26:27.0602343Z libcurand-10.3.7.77 | 39.9 MB | 7 | 7%  2025-05-07T20:26:27.0602758Z 2025-05-07T20:26:27.0602764Z 2025-05-07T20:26:27.0602769Z 2025-05-07T20:26:27.0602773Z 2025-05-07T20:26:27.0603086Z 2025-05-07T20:26:27.0603091Z 2025-05-07T20:26:27.0603110Z 2025-05-07T20:26:27.0603118Z 2025-05-07T20:26:27.0834049Z cuda-nvdisasm-12.6.7 | 47.6 MB | ######6 | 67%  2025-05-07T20:26:27.1241461Z nsight-compute-2024. | 443.1 MB | #########7 | 98% 2025-05-07T20:26:27.1241865Z 2025-05-07T20:26:27.1241871Z 2025-05-07T20:26:27.1241876Z 2025-05-07T20:26:27.1241881Z 2025-05-07T20:26:27.1241900Z 2025-05-07T20:26:27.1241906Z 2025-05-07T20:26:27.1241911Z 2025-05-07T20:26:27.1241916Z 2025-05-07T20:26:27.1241921Z 2025-05-07T20:26:27.1667944Z libcurand-10.3.7.77 | 39.9 MB | #4 | 14%  2025-05-07T20:26:27.1668370Z 2025-05-07T20:26:27.1668376Z 2025-05-07T20:26:27.1668381Z 2025-05-07T20:26:27.1668388Z 2025-05-07T20:26:27.1668393Z 2025-05-07T20:26:27.1668398Z 2025-05-07T20:26:27.1668404Z 2025-05-07T20:26:27.1668409Z 2025-05-07T20:26:27.2242312Z cuda-nvdisasm-12.6.7 | 47.6 MB | #######2 | 73%  2025-05-07T20:26:27.2242786Z 2025-05-07T20:26:27.2242790Z 2025-05-07T20:26:27.2242794Z 2025-05-07T20:26:27.2242797Z 2025-05-07T20:26:27.2242800Z 2025-05-07T20:26:27.2242804Z 2025-05-07T20:26:27.2242807Z 2025-05-07T20:26:27.2242822Z 2025-05-07T20:26:27.2242825Z 2025-05-07T20:26:27.2672438Z libcurand-10.3.7.77 | 39.9 MB | ##2 | 23%  2025-05-07T20:26:27.2672858Z 2025-05-07T20:26:27.2672886Z 2025-05-07T20:26:27.2672892Z 2025-05-07T20:26:27.2672897Z 2025-05-07T20:26:27.2672902Z 2025-05-07T20:26:27.2672924Z 2025-05-07T20:26:27.2672929Z 2025-05-07T20:26:27.2672934Z 2025-05-07T20:26:27.2823272Z cuda-nvdisasm-12.6.7 | 47.6 MB | #######9 | 80%  2025-05-07T20:26:27.3303799Z nsight-compute-2024. | 443.1 MB | #########8 | 98% 2025-05-07T20:26:27.3304088Z 2025-05-07T20:26:27.3304096Z 2025-05-07T20:26:27.3304103Z 2025-05-07T20:26:27.3304108Z 2025-05-07T20:26:27.3304113Z 2025-05-07T20:26:27.3304119Z 2025-05-07T20:26:27.3304125Z 2025-05-07T20:26:27.3304161Z 2025-05-07T20:26:27.3306242Z 2025-05-07T20:26:27.3712145Z libcurand-10.3.7.77 | 39.9 MB | ### | 30%  2025-05-07T20:26:27.3712473Z 2025-05-07T20:26:27.3712478Z 2025-05-07T20:26:27.3712483Z 2025-05-07T20:26:27.3712505Z 2025-05-07T20:26:27.3712511Z 2025-05-07T20:26:27.3712516Z 2025-05-07T20:26:27.3712522Z 2025-05-07T20:26:27.3714854Z 2025-05-07T20:26:27.3895888Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########6 | 86%  2025-05-07T20:26:27.4464583Z nsight-compute-2024. | 443.1 MB | #########8 | 99% 2025-05-07T20:26:27.4464907Z 2025-05-07T20:26:27.4464911Z 2025-05-07T20:26:27.4464915Z 2025-05-07T20:26:27.4464920Z 2025-05-07T20:26:27.4464932Z 2025-05-07T20:26:27.4464936Z 2025-05-07T20:26:27.4464940Z 2025-05-07T20:26:27.4464943Z 2025-05-07T20:26:27.4464947Z 2025-05-07T20:26:27.4752922Z libcurand-10.3.7.77 | 39.9 MB | ###7 | 38%  2025-05-07T20:26:27.4753293Z 2025-05-07T20:26:27.4753339Z 2025-05-07T20:26:27.4753344Z 2025-05-07T20:26:27.4753349Z 2025-05-07T20:26:27.4753354Z 2025-05-07T20:26:27.4753360Z 2025-05-07T20:26:27.4753365Z 2025-05-07T20:26:27.4753372Z 2025-05-07T20:26:27.4898900Z cuda-nvdisasm-12.6.7 | 47.6 MB | #########2 | 93%  2025-05-07T20:26:27.5470363Z nsight-compute-2024. | 443.1 MB | #########9 | 99% 2025-05-07T20:26:27.5470646Z 2025-05-07T20:26:27.5470942Z 2025-05-07T20:26:27.5470965Z 2025-05-07T20:26:27.5470970Z 2025-05-07T20:26:27.5470975Z 2025-05-07T20:26:27.5470980Z 2025-05-07T20:26:27.5470985Z 2025-05-07T20:26:27.5470990Z 2025-05-07T20:26:27.5470995Z 2025-05-07T20:26:27.5764894Z libcurand-10.3.7.77 | 39.9 MB | ####4 | 45%  2025-05-07T20:26:27.5765324Z 2025-05-07T20:26:27.5765329Z 2025-05-07T20:26:27.5765332Z 2025-05-07T20:26:27.5765336Z 2025-05-07T20:26:27.5765339Z 2025-05-07T20:26:27.5765343Z 2025-05-07T20:26:27.5765346Z 2025-05-07T20:26:27.5765350Z 2025-05-07T20:26:27.5909847Z cuda-nvdisasm-12.6.7 | 47.6 MB | #########8 | 99%  2025-05-07T20:26:27.6472760Z nsight-compute-2024. | 443.1 MB | #########9 | 100% 2025-05-07T20:26:27.6473064Z 2025-05-07T20:26:27.6473068Z 2025-05-07T20:26:27.6473072Z 2025-05-07T20:26:27.6473076Z 2025-05-07T20:26:27.6473080Z 2025-05-07T20:26:27.6473083Z 2025-05-07T20:26:27.6473087Z 2025-05-07T20:26:27.6473090Z 2025-05-07T20:26:27.6473790Z 2025-05-07T20:26:27.7476179Z libcurand-10.3.7.77 | 39.9 MB | #####2 | 53%  2025-05-07T20:26:27.7476490Z 2025-05-07T20:26:27.7476493Z 2025-05-07T20:26:27.7476497Z 2025-05-07T20:26:27.7476500Z 2025-05-07T20:26:27.7476504Z 2025-05-07T20:26:27.7476507Z 2025-05-07T20:26:27.7476511Z 2025-05-07T20:26:27.7476515Z 2025-05-07T20:26:27.7476675Z 2025-05-07T20:26:27.8481106Z libcurand-10.3.7.77 | 39.9 MB | ###### | 61%  2025-05-07T20:26:27.8481429Z 2025-05-07T20:26:27.8481433Z 2025-05-07T20:26:27.8481436Z 2025-05-07T20:26:27.8481473Z 2025-05-07T20:26:27.8481476Z 2025-05-07T20:26:27.8481480Z 2025-05-07T20:26:27.8481483Z 2025-05-07T20:26:27.8481487Z 2025-05-07T20:26:27.8481945Z 2025-05-07T20:26:27.9482367Z libcurand-10.3.7.77 | 39.9 MB | ######9 | 70%  2025-05-07T20:26:27.9482706Z 2025-05-07T20:26:27.9482713Z 2025-05-07T20:26:27.9482717Z 2025-05-07T20:26:27.9482721Z 2025-05-07T20:26:27.9482733Z 2025-05-07T20:26:27.9482771Z 2025-05-07T20:26:27.9482775Z 2025-05-07T20:26:27.9482780Z 2025-05-07T20:26:27.9484213Z 2025-05-07T20:26:28.0491624Z libcurand-10.3.7.77 | 39.9 MB | #######8 | 79%  2025-05-07T20:26:28.0491955Z 2025-05-07T20:26:28.0491960Z 2025-05-07T20:26:28.0491963Z 2025-05-07T20:26:28.0491967Z 2025-05-07T20:26:28.0491971Z 2025-05-07T20:26:28.0491974Z 2025-05-07T20:26:28.0491977Z 2025-05-07T20:26:28.0491981Z 2025-05-07T20:26:28.0492649Z 2025-05-07T20:26:28.1496276Z libcurand-10.3.7.77 | 39.9 MB | ########7 | 87%  2025-05-07T20:26:28.1496729Z 2025-05-07T20:26:28.1496733Z 2025-05-07T20:26:28.1496745Z 2025-05-07T20:26:28.1496749Z 2025-05-07T20:26:28.1496752Z 2025-05-07T20:26:28.1496756Z 2025-05-07T20:26:28.1496759Z 2025-05-07T20:26:28.1496763Z 2025-05-07T20:26:28.1498757Z 2025-05-07T20:26:28.5061640Z libcurand-10.3.7.77 | 39.9 MB | #########7 | 97%  2025-05-07T20:26:28.5061988Z 2025-05-07T20:26:28.5062029Z 2025-05-07T20:26:28.5062033Z 2025-05-07T20:26:28.5062037Z 2025-05-07T20:26:28.5062040Z 2025-05-07T20:26:28.5062043Z 2025-05-07T20:26:28.5062047Z 2025-05-07T20:26:28.5346770Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:26:28.5347093Z 2025-05-07T20:26:28.5347099Z 2025-05-07T20:26:28.5347156Z 2025-05-07T20:26:28.5594488Z libcusparse-12.5.4.2 | 118.6 MB | ########## | 100%  2025-05-07T20:26:28.5594784Z 2025-05-07T20:26:28.5594787Z 2025-05-07T20:26:28.5594791Z 2025-05-07T20:26:28.5594794Z 2025-05-07T20:26:28.5594822Z 2025-05-07T20:26:28.5594827Z 2025-05-07T20:26:28.5594830Z 2025-05-07T20:26:28.5594834Z 2025-05-07T20:26:28.5594837Z 2025-05-07T20:26:28.5596229Z 2025-05-07T20:26:28.6595995Z gds-tools-1.11.1.6 | 37.8 MB | | 0%  2025-05-07T20:26:28.6596329Z 2025-05-07T20:26:28.6596333Z 2025-05-07T20:26:28.6596337Z 2025-05-07T20:26:28.6596341Z 2025-05-07T20:26:28.6596598Z 2025-05-07T20:26:28.6596602Z 2025-05-07T20:26:28.6596606Z 2025-05-07T20:26:28.6596609Z 2025-05-07T20:26:28.6596613Z 2025-05-07T20:26:28.6596616Z 2025-05-07T20:26:28.7598338Z gds-tools-1.11.1.6 | 37.8 MB | 8 | 8%  2025-05-07T20:26:28.7598675Z 2025-05-07T20:26:28.7598679Z 2025-05-07T20:26:28.7598683Z 2025-05-07T20:26:28.7598687Z 2025-05-07T20:26:28.7598691Z 2025-05-07T20:26:28.7598695Z 2025-05-07T20:26:28.7598700Z 2025-05-07T20:26:28.7598703Z 2025-05-07T20:26:28.7598707Z 2025-05-07T20:26:28.7599685Z 2025-05-07T20:26:28.8602904Z gds-tools-1.11.1.6 | 37.8 MB | #6 | 17%  2025-05-07T20:26:28.8603245Z 2025-05-07T20:26:28.8603249Z 2025-05-07T20:26:28.8603253Z 2025-05-07T20:26:28.8603256Z 2025-05-07T20:26:28.8603260Z 2025-05-07T20:26:28.8603263Z 2025-05-07T20:26:28.8603267Z 2025-05-07T20:26:28.8603271Z 2025-05-07T20:26:28.8603274Z 2025-05-07T20:26:28.8603551Z 2025-05-07T20:26:28.9607726Z gds-tools-1.11.1.6 | 37.8 MB | ##5 | 25%  2025-05-07T20:26:28.9608062Z 2025-05-07T20:26:28.9608066Z 2025-05-07T20:26:28.9608070Z 2025-05-07T20:26:28.9608075Z 2025-05-07T20:26:28.9608078Z 2025-05-07T20:26:28.9608082Z 2025-05-07T20:26:28.9608085Z 2025-05-07T20:26:28.9608089Z 2025-05-07T20:26:28.9608092Z 2025-05-07T20:26:28.9610130Z 2025-05-07T20:26:29.0609386Z gds-tools-1.11.1.6 | 37.8 MB | ###4 | 35%  2025-05-07T20:26:29.0609714Z 2025-05-07T20:26:29.0609718Z 2025-05-07T20:26:29.0609729Z 2025-05-07T20:26:29.0609765Z 2025-05-07T20:26:29.0609770Z 2025-05-07T20:26:29.0609775Z 2025-05-07T20:26:29.0609780Z 2025-05-07T20:26:29.0609787Z 2025-05-07T20:26:29.0609792Z 2025-05-07T20:26:29.0613352Z 2025-05-07T20:26:29.1612365Z gds-tools-1.11.1.6 | 37.8 MB | ####4 | 44%  2025-05-07T20:26:29.1612702Z 2025-05-07T20:26:29.1612706Z 2025-05-07T20:26:29.1612710Z 2025-05-07T20:26:29.1612713Z 2025-05-07T20:26:29.1612752Z 2025-05-07T20:26:29.1612755Z 2025-05-07T20:26:29.1612759Z 2025-05-07T20:26:29.1612763Z 2025-05-07T20:26:29.1612767Z 2025-05-07T20:26:29.1612770Z 2025-05-07T20:26:29.2027857Z gds-tools-1.11.1.6 | 37.8 MB | #####4 | 54%  2025-05-07T20:26:29.2028360Z 2025-05-07T20:26:29.2090354Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:26:29.2090630Z 2025-05-07T20:26:29.2090634Z 2025-05-07T20:26:29.2090638Z 2025-05-07T20:26:29.2090641Z 2025-05-07T20:26:29.2090645Z 2025-05-07T20:26:29.2090649Z 2025-05-07T20:26:29.2090682Z 2025-05-07T20:26:29.2092455Z 2025-05-07T20:26:29.2507589Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:26:29.2507905Z 2025-05-07T20:26:29.2507909Z 2025-05-07T20:26:29.2507922Z 2025-05-07T20:26:29.2507926Z 2025-05-07T20:26:29.2507929Z 2025-05-07T20:26:29.2507937Z 2025-05-07T20:26:29.2507940Z 2025-05-07T20:26:29.2507944Z 2025-05-07T20:26:29.2507948Z 2025-05-07T20:26:29.2507969Z 2025-05-07T20:26:29.2507973Z 2025-05-07T20:26:29.2508849Z 2025-05-07T20:26:29.2612437Z cuda-nvrtc-12.6.85 | 17.3 MB | | 0%  2025-05-07T20:26:29.2612837Z 2025-05-07T20:26:29.2612841Z 2025-05-07T20:26:29.2612845Z 2025-05-07T20:26:29.2612848Z 2025-05-07T20:26:29.2612852Z 2025-05-07T20:26:29.2612855Z 2025-05-07T20:26:29.2612859Z 2025-05-07T20:26:29.2612862Z 2025-05-07T20:26:29.2612866Z 2025-05-07T20:26:29.2612869Z 2025-05-07T20:26:29.2623029Z gds-tools-1.11.1.6 | 37.8 MB | ######4 | 64%  2025-05-07T20:26:29.2623348Z 2025-05-07T20:26:29.2623352Z 2025-05-07T20:26:29.2623356Z 2025-05-07T20:26:29.2623359Z 2025-05-07T20:26:29.2623363Z 2025-05-07T20:26:29.2623366Z 2025-05-07T20:26:29.2623370Z 2025-05-07T20:26:29.2623373Z 2025-05-07T20:26:29.2623377Z 2025-05-07T20:26:29.2623380Z 2025-05-07T20:26:29.2625973Z 2025-05-07T20:26:29.3512801Z cuda-nvcc-tools-12.6 | 23.0 MB | | 0%  2025-05-07T20:26:29.3513171Z 2025-05-07T20:26:29.3513175Z 2025-05-07T20:26:29.3513179Z 2025-05-07T20:26:29.3513184Z 2025-05-07T20:26:29.3513187Z 2025-05-07T20:26:29.3513191Z 2025-05-07T20:26:29.3513194Z 2025-05-07T20:26:29.3513198Z 2025-05-07T20:26:29.3513201Z 2025-05-07T20:26:29.3513205Z 2025-05-07T20:26:29.3513208Z 2025-05-07T20:26:29.3514678Z 2025-05-07T20:26:29.3620576Z cuda-nvrtc-12.6.85 | 17.3 MB | #6 | 16%  2025-05-07T20:26:29.3620957Z 2025-05-07T20:26:29.3620961Z 2025-05-07T20:26:29.3621236Z 2025-05-07T20:26:29.3621240Z 2025-05-07T20:26:29.3621243Z 2025-05-07T20:26:29.3621247Z 2025-05-07T20:26:29.3621250Z 2025-05-07T20:26:29.3621254Z 2025-05-07T20:26:29.3621257Z 2025-05-07T20:26:29.3621261Z 2025-05-07T20:26:29.3626340Z 2025-05-07T20:26:29.3879947Z cuda-nvcc-tools-12.6 | 23.0 MB | #1 | 12%  2025-05-07T20:26:29.3880343Z 2025-05-07T20:26:29.3880347Z 2025-05-07T20:26:29.3880368Z 2025-05-07T20:26:29.3880372Z 2025-05-07T20:26:29.3880376Z 2025-05-07T20:26:29.3880390Z 2025-05-07T20:26:29.3880393Z 2025-05-07T20:26:29.3880397Z 2025-05-07T20:26:29.3880400Z 2025-05-07T20:26:29.3880404Z 2025-05-07T20:26:29.4515601Z gds-tools-1.11.1.6 | 37.8 MB | #######3 | 74%  2025-05-07T20:26:29.4515935Z 2025-05-07T20:26:29.4515939Z 2025-05-07T20:26:29.4515942Z 2025-05-07T20:26:29.4515946Z 2025-05-07T20:26:29.4515951Z 2025-05-07T20:26:29.4515956Z 2025-05-07T20:26:29.4515959Z 2025-05-07T20:26:29.4515993Z 2025-05-07T20:26:29.4515997Z 2025-05-07T20:26:29.4516001Z 2025-05-07T20:26:29.4516004Z 2025-05-07T20:26:29.4519054Z 2025-05-07T20:26:29.4634707Z cuda-nvrtc-12.6.85 | 17.3 MB | ###3 | 34%  2025-05-07T20:26:29.4635036Z 2025-05-07T20:26:29.4635040Z 2025-05-07T20:26:29.4635043Z 2025-05-07T20:26:29.4635047Z 2025-05-07T20:26:29.4635050Z 2025-05-07T20:26:29.4635054Z 2025-05-07T20:26:29.4635074Z 2025-05-07T20:26:29.4635077Z 2025-05-07T20:26:29.4635081Z 2025-05-07T20:26:29.4635084Z 2025-05-07T20:26:29.4635088Z 2025-05-07T20:26:29.5040584Z cuda-nvcc-tools-12.6 | 23.0 MB | ##3 | 23%  2025-05-07T20:26:29.5040920Z 2025-05-07T20:26:29.5040924Z 2025-05-07T20:26:29.5040927Z 2025-05-07T20:26:29.5040931Z 2025-05-07T20:26:29.5040935Z 2025-05-07T20:26:29.5040938Z 2025-05-07T20:26:29.5040942Z 2025-05-07T20:26:29.5040952Z 2025-05-07T20:26:29.5040956Z 2025-05-07T20:26:29.5098820Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:26:29.5099152Z 2025-05-07T20:26:29.5099157Z 2025-05-07T20:26:29.5099165Z 2025-05-07T20:26:29.5099169Z 2025-05-07T20:26:29.5099172Z 2025-05-07T20:26:29.5099176Z 2025-05-07T20:26:29.5099180Z 2025-05-07T20:26:29.5099185Z 2025-05-07T20:26:29.5099188Z 2025-05-07T20:26:29.5100972Z 2025-05-07T20:26:29.5531958Z gds-tools-1.11.1.6 | 37.8 MB | ########2 | 83%  2025-05-07T20:26:29.5532336Z 2025-05-07T20:26:29.5532340Z 2025-05-07T20:26:29.5532344Z 2025-05-07T20:26:29.5532347Z 2025-05-07T20:26:29.5532351Z 2025-05-07T20:26:29.5532354Z 2025-05-07T20:26:29.5532358Z 2025-05-07T20:26:29.5532362Z 2025-05-07T20:26:29.5532365Z 2025-05-07T20:26:29.5532369Z 2025-05-07T20:26:29.5532384Z 2025-05-07T20:26:29.5532388Z 2025-05-07T20:26:29.5532391Z 2025-05-07T20:26:29.5634706Z libnvjitlink-12.6.85 | 14.9 MB | | 0%  2025-05-07T20:26:29.5635048Z 2025-05-07T20:26:29.5635083Z 2025-05-07T20:26:29.5635087Z 2025-05-07T20:26:29.5635090Z 2025-05-07T20:26:29.5635093Z 2025-05-07T20:26:29.5635097Z 2025-05-07T20:26:29.5635100Z 2025-05-07T20:26:29.5635104Z 2025-05-07T20:26:29.5635107Z 2025-05-07T20:26:29.5635111Z 2025-05-07T20:26:29.5635114Z 2025-05-07T20:26:29.5703139Z cuda-nvcc-tools-12.6 | 23.0 MB | ###6 | 36%  2025-05-07T20:26:29.5703476Z 2025-05-07T20:26:29.5703713Z 2025-05-07T20:26:29.5703718Z 2025-05-07T20:26:29.5703721Z 2025-05-07T20:26:29.5703725Z 2025-05-07T20:26:29.5703733Z 2025-05-07T20:26:29.5703739Z 2025-05-07T20:26:29.5703744Z 2025-05-07T20:26:29.5703747Z 2025-05-07T20:26:29.5703751Z 2025-05-07T20:26:29.5703754Z 2025-05-07T20:26:29.5711597Z 2025-05-07T20:26:29.6251607Z cuda-nvrtc-12.6.85 | 17.3 MB | ##### | 50%  2025-05-07T20:26:29.6251935Z 2025-05-07T20:26:29.6251939Z 2025-05-07T20:26:29.6251942Z 2025-05-07T20:26:29.6251946Z 2025-05-07T20:26:29.6251951Z 2025-05-07T20:26:29.6252211Z 2025-05-07T20:26:29.6252215Z 2025-05-07T20:26:29.6252218Z 2025-05-07T20:26:29.6252223Z 2025-05-07T20:26:29.6252237Z 2025-05-07T20:26:29.6528762Z gds-tools-1.11.1.6 | 37.8 MB | #########1 | 91%  2025-05-07T20:26:29.6529147Z 2025-05-07T20:26:29.6529152Z 2025-05-07T20:26:29.6529156Z 2025-05-07T20:26:29.6529169Z 2025-05-07T20:26:29.6529173Z 2025-05-07T20:26:29.6529177Z 2025-05-07T20:26:29.6529194Z 2025-05-07T20:26:29.6529198Z 2025-05-07T20:26:29.6529201Z 2025-05-07T20:26:29.6529205Z 2025-05-07T20:26:29.6529208Z 2025-05-07T20:26:29.6529212Z 2025-05-07T20:26:29.6529215Z 2025-05-07T20:26:29.6805178Z libnvjitlink-12.6.85 | 14.9 MB | #7 | 18%  2025-05-07T20:26:29.6805520Z 2025-05-07T20:26:29.6805524Z 2025-05-07T20:26:29.6805528Z 2025-05-07T20:26:29.6805531Z 2025-05-07T20:26:29.6805535Z 2025-05-07T20:26:29.6805538Z 2025-05-07T20:26:29.6805542Z 2025-05-07T20:26:29.6805545Z 2025-05-07T20:26:29.6805568Z 2025-05-07T20:26:29.6805571Z 2025-05-07T20:26:29.6809094Z 2025-05-07T20:26:29.6840705Z cuda-nvcc-tools-12.6 | 23.0 MB | ####8 | 48%  2025-05-07T20:26:29.6841354Z 2025-05-07T20:26:29.6841358Z 2025-05-07T20:26:29.6841361Z 2025-05-07T20:26:29.6841365Z 2025-05-07T20:26:29.6841368Z 2025-05-07T20:26:29.6841371Z 2025-05-07T20:26:29.6841375Z 2025-05-07T20:26:29.6841378Z 2025-05-07T20:26:29.6841392Z 2025-05-07T20:26:29.6841396Z 2025-05-07T20:26:29.6841451Z 2025-05-07T20:26:29.6841455Z 2025-05-07T20:26:29.7430736Z cuda-nvrtc-12.6.85 | 17.3 MB | ######6 | 66%  2025-05-07T20:26:29.7431171Z 2025-05-07T20:26:29.7431177Z 2025-05-07T20:26:29.7431182Z 2025-05-07T20:26:29.7431187Z 2025-05-07T20:26:29.7431192Z 2025-05-07T20:26:29.7431197Z 2025-05-07T20:26:29.7431204Z 2025-05-07T20:26:29.7431209Z 2025-05-07T20:26:29.7431215Z 2025-05-07T20:26:29.7431221Z 2025-05-07T20:26:29.7566700Z gds-tools-1.11.1.6 | 37.8 MB | #########9 | 100%  2025-05-07T20:26:29.7567192Z 2025-05-07T20:26:29.7567198Z 2025-05-07T20:26:29.7567203Z 2025-05-07T20:26:29.7567209Z 2025-05-07T20:26:29.7567215Z 2025-05-07T20:26:29.7567220Z 2025-05-07T20:26:29.7567227Z 2025-05-07T20:26:29.7567233Z 2025-05-07T20:26:29.7567238Z 2025-05-07T20:26:29.7567245Z 2025-05-07T20:26:29.7567251Z 2025-05-07T20:26:29.7567257Z 2025-05-07T20:26:29.7567279Z 2025-05-07T20:26:29.7893095Z libnvjitlink-12.6.85 | 14.9 MB | ###5 | 36%  2025-05-07T20:26:29.7893435Z 2025-05-07T20:26:29.7893439Z 2025-05-07T20:26:29.7893442Z 2025-05-07T20:26:29.7893446Z 2025-05-07T20:26:29.7893449Z 2025-05-07T20:26:29.7893460Z 2025-05-07T20:26:29.7893464Z 2025-05-07T20:26:29.7893467Z 2025-05-07T20:26:29.7893471Z 2025-05-07T20:26:29.7893474Z 2025-05-07T20:26:29.7893478Z 2025-05-07T20:26:29.7896960Z 2025-05-07T20:26:29.7979760Z cuda-nvrtc-12.6.85 | 17.3 MB | ######## | 81%  2025-05-07T20:26:29.7980252Z 2025-05-07T20:26:29.7980258Z 2025-05-07T20:26:29.7980264Z 2025-05-07T20:26:29.7980269Z 2025-05-07T20:26:29.7980274Z 2025-05-07T20:26:29.7980280Z 2025-05-07T20:26:29.7980285Z 2025-05-07T20:26:29.7980290Z 2025-05-07T20:26:29.7980295Z 2025-05-07T20:26:29.7980300Z 2025-05-07T20:26:29.7983618Z 2025-05-07T20:26:29.8569057Z cuda-nvcc-tools-12.6 | 23.0 MB | #####9 | 60%  2025-05-07T20:26:29.8569405Z 2025-05-07T20:26:29.8569409Z 2025-05-07T20:26:29.8569413Z 2025-05-07T20:26:29.8569416Z 2025-05-07T20:26:29.8569430Z 2025-05-07T20:26:29.8569434Z 2025-05-07T20:26:29.8569438Z 2025-05-07T20:26:29.8569441Z 2025-05-07T20:26:29.8569445Z 2025-05-07T20:26:29.8569448Z 2025-05-07T20:26:29.8569452Z 2025-05-07T20:26:29.8569455Z 2025-05-07T20:26:29.8574110Z 2025-05-07T20:26:29.8900104Z libnvjitlink-12.6.85 | 14.9 MB | #####4 | 55%  2025-05-07T20:26:29.8900444Z 2025-05-07T20:26:29.8900708Z 2025-05-07T20:26:29.8900711Z 2025-05-07T20:26:29.8900715Z 2025-05-07T20:26:29.8900719Z 2025-05-07T20:26:29.8900722Z 2025-05-07T20:26:29.8900725Z 2025-05-07T20:26:29.8900729Z 2025-05-07T20:26:29.8900732Z 2025-05-07T20:26:29.8900736Z 2025-05-07T20:26:29.8900739Z 2025-05-07T20:26:29.8903761Z 2025-05-07T20:26:29.8983730Z cuda-nvrtc-12.6.85 | 17.3 MB | #########6 | 97%  2025-05-07T20:26:29.8984062Z 2025-05-07T20:26:29.8984066Z 2025-05-07T20:26:29.8984070Z 2025-05-07T20:26:29.8984073Z 2025-05-07T20:26:29.8984077Z 2025-05-07T20:26:29.8984080Z 2025-05-07T20:26:29.8984084Z 2025-05-07T20:26:29.8984087Z 2025-05-07T20:26:29.8984091Z 2025-05-07T20:26:29.8984095Z 2025-05-07T20:26:29.8986007Z 2025-05-07T20:26:29.9573937Z cuda-nvcc-tools-12.6 | 23.0 MB | ####### | 71%  2025-05-07T20:26:29.9574434Z 2025-05-07T20:26:29.9574442Z 2025-05-07T20:26:29.9574448Z 2025-05-07T20:26:29.9574454Z 2025-05-07T20:26:29.9574482Z 2025-05-07T20:26:29.9574488Z 2025-05-07T20:26:29.9574493Z 2025-05-07T20:26:29.9574610Z 2025-05-07T20:26:29.9574618Z 2025-05-07T20:26:29.9574624Z 2025-05-07T20:26:29.9574631Z 2025-05-07T20:26:29.9574637Z 2025-05-07T20:26:29.9574656Z 2025-05-07T20:26:29.9995881Z libnvjitlink-12.6.85 | 14.9 MB | #######3 | 74%  2025-05-07T20:26:29.9996218Z 2025-05-07T20:26:29.9996229Z 2025-05-07T20:26:29.9996252Z 2025-05-07T20:26:29.9996264Z 2025-05-07T20:26:29.9996268Z 2025-05-07T20:26:29.9996272Z 2025-05-07T20:26:29.9996275Z 2025-05-07T20:26:29.9996279Z 2025-05-07T20:26:29.9996282Z 2025-05-07T20:26:29.9996286Z 2025-05-07T20:26:29.9999644Z 2025-05-07T20:26:30.0590654Z cuda-nvcc-tools-12.6 | 23.0 MB | ########2 | 82%  2025-05-07T20:26:30.0591114Z 2025-05-07T20:26:30.0591120Z 2025-05-07T20:26:30.0591126Z 2025-05-07T20:26:30.0591139Z 2025-05-07T20:26:30.0591145Z 2025-05-07T20:26:30.0591150Z 2025-05-07T20:26:30.0591182Z 2025-05-07T20:26:30.0591187Z 2025-05-07T20:26:30.0591193Z 2025-05-07T20:26:30.0591198Z 2025-05-07T20:26:30.0591204Z 2025-05-07T20:26:30.0591209Z 2025-05-07T20:26:30.0591220Z 2025-05-07T20:26:30.0713738Z libnvjitlink-12.6.85 | 14.9 MB | #########2 | 93%  2025-05-07T20:26:30.0714081Z 2025-05-07T20:26:30.0715192Z 2025-05-07T20:26:30.1007848Z libcufft-11.3.0.4 | 156.2 MB | ########## | 100%  2025-05-07T20:26:30.1008140Z 2025-05-07T20:26:30.1008144Z 2025-05-07T20:26:30.1008148Z 2025-05-07T20:26:30.1008151Z 2025-05-07T20:26:30.1008155Z 2025-05-07T20:26:30.1008159Z 2025-05-07T20:26:30.1008162Z 2025-05-07T20:26:30.1008166Z 2025-05-07T20:26:30.1008169Z 2025-05-07T20:26:30.1008173Z 2025-05-07T20:26:30.1008176Z 2025-05-07T20:26:30.5390397Z cuda-nvcc-tools-12.6 | 23.0 MB | #########3 | 93%  2025-05-07T20:26:30.5390744Z 2025-05-07T20:26:30.5390748Z 2025-05-07T20:26:30.5390752Z 2025-05-07T20:26:30.5390755Z 2025-05-07T20:26:30.5390800Z 2025-05-07T20:26:30.5390803Z 2025-05-07T20:26:30.5390807Z 2025-05-07T20:26:30.5390810Z 2025-05-07T20:26:30.5390813Z 2025-05-07T20:26:30.5390827Z 2025-05-07T20:26:30.5390831Z 2025-05-07T20:26:30.5391832Z 2025-05-07T20:26:30.5528071Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:26:30.5528490Z 2025-05-07T20:26:30.5528494Z 2025-05-07T20:26:30.5528773Z 2025-05-07T20:26:30.5528778Z 2025-05-07T20:26:30.5528781Z 2025-05-07T20:26:30.5528784Z 2025-05-07T20:26:30.5528788Z 2025-05-07T20:26:30.5528792Z 2025-05-07T20:26:30.5528795Z 2025-05-07T20:26:30.5528799Z 2025-05-07T20:26:30.5528803Z 2025-05-07T20:26:30.5528806Z 2025-05-07T20:26:30.5530252Z 2025-05-07T20:26:30.5821485Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:26:30.5821925Z 2025-05-07T20:26:30.5821929Z 2025-05-07T20:26:30.5821933Z 2025-05-07T20:26:30.5821936Z 2025-05-07T20:26:30.5821940Z 2025-05-07T20:26:30.5822212Z 2025-05-07T20:26:30.5822216Z 2025-05-07T20:26:30.5822219Z 2025-05-07T20:26:30.5822223Z 2025-05-07T20:26:30.5822226Z 2025-05-07T20:26:30.5822229Z 2025-05-07T20:26:30.5822233Z 2025-05-07T20:26:30.5822245Z 2025-05-07T20:26:30.5822249Z 2025-05-07T20:26:30.6061057Z cuda-nvcc-dev_linux- | 10.8 MB | | 0%  2025-05-07T20:26:30.6061480Z 2025-05-07T20:26:30.6061498Z 2025-05-07T20:26:30.6061510Z 2025-05-07T20:26:30.6061514Z 2025-05-07T20:26:30.6061517Z 2025-05-07T20:26:30.6061521Z 2025-05-07T20:26:30.6061524Z 2025-05-07T20:26:30.6061528Z 2025-05-07T20:26:30.6061531Z 2025-05-07T20:26:30.6061535Z 2025-05-07T20:26:30.6061538Z 2025-05-07T20:26:30.6061542Z 2025-05-07T20:26:30.6061545Z 2025-05-07T20:26:30.6061549Z 2025-05-07T20:26:30.6061552Z 2025-05-07T20:26:30.6824346Z cuda-nvvm-tools-12.6 | 10.4 MB | | 0%  2025-05-07T20:26:30.6824699Z 2025-05-07T20:26:30.6824721Z 2025-05-07T20:26:30.6824724Z 2025-05-07T20:26:30.6824728Z 2025-05-07T20:26:30.6824731Z 2025-05-07T20:26:30.6824735Z 2025-05-07T20:26:30.6824738Z 2025-05-07T20:26:30.6824742Z 2025-05-07T20:26:30.6824745Z 2025-05-07T20:26:30.6824749Z 2025-05-07T20:26:30.6824752Z 2025-05-07T20:26:30.6824756Z 2025-05-07T20:26:30.6824759Z 2025-05-07T20:26:30.6826354Z 2025-05-07T20:26:30.7066683Z cuda-nvcc-dev_linux- | 10.8 MB | ###2 | 32%  2025-05-07T20:26:30.7067036Z 2025-05-07T20:26:30.7067040Z 2025-05-07T20:26:30.7067043Z 2025-05-07T20:26:30.7067055Z 2025-05-07T20:26:30.7067058Z 2025-05-07T20:26:30.7067062Z 2025-05-07T20:26:30.7067065Z 2025-05-07T20:26:30.7067068Z 2025-05-07T20:26:30.7067072Z 2025-05-07T20:26:30.7067075Z 2025-05-07T20:26:30.7067079Z 2025-05-07T20:26:30.7067082Z 2025-05-07T20:26:30.7067085Z 2025-05-07T20:26:30.7067089Z 2025-05-07T20:26:30.7067092Z 2025-05-07T20:26:30.7982875Z cuda-nvvm-tools-12.6 | 10.4 MB | ##5 | 25%  2025-05-07T20:26:30.7983246Z 2025-05-07T20:26:30.7983250Z 2025-05-07T20:26:30.7983254Z 2025-05-07T20:26:30.7983257Z 2025-05-07T20:26:30.7983261Z 2025-05-07T20:26:30.7983265Z 2025-05-07T20:26:30.7983269Z 2025-05-07T20:26:30.7983273Z 2025-05-07T20:26:30.7983276Z 2025-05-07T20:26:30.7983280Z 2025-05-07T20:26:30.7983283Z 2025-05-07T20:26:30.7983287Z 2025-05-07T20:26:30.7983290Z 2025-05-07T20:26:30.7986782Z 2025-05-07T20:26:30.8067759Z cuda-nvcc-dev_linux- | 10.8 MB | ######4 | 65%  2025-05-07T20:26:30.8068175Z 2025-05-07T20:26:30.8068179Z 2025-05-07T20:26:30.8068183Z 2025-05-07T20:26:30.8068186Z 2025-05-07T20:26:30.8068190Z 2025-05-07T20:26:30.8068194Z 2025-05-07T20:26:30.8068197Z 2025-05-07T20:26:30.8068201Z 2025-05-07T20:26:30.8068204Z 2025-05-07T20:26:30.8068208Z 2025-05-07T20:26:30.8068211Z 2025-05-07T20:26:30.8068215Z 2025-05-07T20:26:30.8068225Z 2025-05-07T20:26:30.8068229Z 2025-05-07T20:26:30.8068241Z 2025-05-07T20:26:30.8815230Z cuda-nvvm-tools-12.6 | 10.4 MB | ##### | 51%  2025-05-07T20:26:30.8815661Z 2025-05-07T20:26:30.8815677Z 2025-05-07T20:26:30.8815682Z 2025-05-07T20:26:30.8815687Z 2025-05-07T20:26:30.8815692Z 2025-05-07T20:26:30.8815697Z 2025-05-07T20:26:30.8815702Z 2025-05-07T20:26:30.8815707Z 2025-05-07T20:26:30.8815713Z 2025-05-07T20:26:30.8815947Z 2025-05-07T20:26:30.8819137Z 2025-05-07T20:26:30.8996211Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:26:30.8996857Z 2025-05-07T20:26:30.8996866Z 2025-05-07T20:26:30.8996873Z 2025-05-07T20:26:30.8996880Z 2025-05-07T20:26:30.8996887Z 2025-05-07T20:26:30.8996894Z 2025-05-07T20:26:30.8996901Z 2025-05-07T20:26:30.8996908Z 2025-05-07T20:26:30.8996915Z 2025-05-07T20:26:30.8996921Z 2025-05-07T20:26:30.8996928Z 2025-05-07T20:26:30.8996935Z 2025-05-07T20:26:30.8996942Z 2025-05-07T20:26:30.8996949Z 2025-05-07T20:26:30.9070755Z cuda-nvcc-dev_linux- | 10.8 MB | #########4 | 95%  2025-05-07T20:26:30.9071187Z 2025-05-07T20:26:30.9071191Z 2025-05-07T20:26:30.9071194Z 2025-05-07T20:26:30.9071198Z 2025-05-07T20:26:30.9071201Z 2025-05-07T20:26:30.9071204Z 2025-05-07T20:26:30.9071208Z 2025-05-07T20:26:30.9071219Z 2025-05-07T20:26:30.9071222Z 2025-05-07T20:26:30.9071226Z 2025-05-07T20:26:30.9071238Z 2025-05-07T20:26:30.9071242Z 2025-05-07T20:26:30.9071245Z 2025-05-07T20:26:30.9071248Z 2025-05-07T20:26:30.9071252Z 2025-05-07T20:26:30.9419186Z cuda-nvvm-tools-12.6 | 10.4 MB | ########2 | 83%  2025-05-07T20:26:30.9419546Z 2025-05-07T20:26:30.9419550Z 2025-05-07T20:26:30.9419553Z 2025-05-07T20:26:30.9419557Z 2025-05-07T20:26:30.9419561Z 2025-05-07T20:26:30.9419564Z 2025-05-07T20:26:30.9419568Z 2025-05-07T20:26:30.9419571Z 2025-05-07T20:26:30.9419575Z 2025-05-07T20:26:30.9419578Z 2025-05-07T20:26:30.9419582Z 2025-05-07T20:26:30.9419598Z 2025-05-07T20:26:30.9419602Z 2025-05-07T20:26:30.9419605Z 2025-05-07T20:26:30.9419616Z 2025-05-07T20:26:30.9422009Z 2025-05-07T20:26:30.9754567Z cuda-sanitizer-api-1 | 8.9 MB | | 0%  2025-05-07T20:26:30.9754921Z 2025-05-07T20:26:30.9754932Z 2025-05-07T20:26:30.9754936Z 2025-05-07T20:26:30.9754940Z 2025-05-07T20:26:30.9754943Z 2025-05-07T20:26:30.9754962Z 2025-05-07T20:26:30.9754966Z 2025-05-07T20:26:30.9754969Z 2025-05-07T20:26:30.9754973Z 2025-05-07T20:26:30.9754976Z 2025-05-07T20:26:31.0318398Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:26:31.0318804Z 2025-05-07T20:26:31.0318808Z 2025-05-07T20:26:31.0318812Z 2025-05-07T20:26:31.0318815Z 2025-05-07T20:26:31.0318819Z 2025-05-07T20:26:31.0318823Z 2025-05-07T20:26:31.0318826Z 2025-05-07T20:26:31.0318830Z 2025-05-07T20:26:31.0318833Z 2025-05-07T20:26:31.0318837Z 2025-05-07T20:26:31.0318840Z 2025-05-07T20:26:31.0318865Z 2025-05-07T20:26:31.0318868Z 2025-05-07T20:26:31.0318872Z 2025-05-07T20:26:31.0318875Z 2025-05-07T20:26:31.0318879Z 2025-05-07T20:26:31.0320195Z 2025-05-07T20:26:31.0422008Z cuda-nvvm-impl-12.6. | 7.7 MB | | 0%  2025-05-07T20:26:31.0422361Z 2025-05-07T20:26:31.0422365Z 2025-05-07T20:26:31.0422368Z 2025-05-07T20:26:31.0422373Z 2025-05-07T20:26:31.0422388Z 2025-05-07T20:26:31.0422392Z 2025-05-07T20:26:31.0422395Z 2025-05-07T20:26:31.0422399Z 2025-05-07T20:26:31.0422402Z 2025-05-07T20:26:31.0422405Z 2025-05-07T20:26:31.0422409Z 2025-05-07T20:26:31.0422412Z 2025-05-07T20:26:31.0422416Z 2025-05-07T20:26:31.0422419Z 2025-05-07T20:26:31.0422423Z 2025-05-07T20:26:31.0423926Z 2025-05-07T20:26:31.1321893Z cuda-sanitizer-api-1 | 8.9 MB | ###9 | 39%  2025-05-07T20:26:31.1322261Z 2025-05-07T20:26:31.1322265Z 2025-05-07T20:26:31.1322268Z 2025-05-07T20:26:31.1322307Z 2025-05-07T20:26:31.1322311Z 2025-05-07T20:26:31.1322314Z 2025-05-07T20:26:31.1322327Z 2025-05-07T20:26:31.1322331Z 2025-05-07T20:26:31.1322335Z 2025-05-07T20:26:31.1322339Z 2025-05-07T20:26:31.1322343Z 2025-05-07T20:26:31.1322347Z 2025-05-07T20:26:31.1322350Z 2025-05-07T20:26:31.1322354Z 2025-05-07T20:26:31.1322357Z 2025-05-07T20:26:31.1322361Z 2025-05-07T20:26:31.1325208Z 2025-05-07T20:26:31.1732275Z cuda-nvvm-impl-12.6. | 7.7 MB | ###6 | 36%  2025-05-07T20:26:31.1732664Z 2025-05-07T20:26:31.1732668Z 2025-05-07T20:26:31.1732671Z 2025-05-07T20:26:31.1732675Z 2025-05-07T20:26:31.1732678Z 2025-05-07T20:26:31.1732682Z 2025-05-07T20:26:31.1732685Z 2025-05-07T20:26:31.1732689Z 2025-05-07T20:26:31.1732692Z 2025-05-07T20:26:31.1732695Z 2025-05-07T20:26:31.1732699Z 2025-05-07T20:26:31.1732702Z 2025-05-07T20:26:31.1732706Z 2025-05-07T20:26:31.1732709Z 2025-05-07T20:26:31.1732713Z 2025-05-07T20:26:31.1734414Z 2025-05-07T20:26:31.2328281Z cuda-sanitizer-api-1 | 8.9 MB | #######8 | 78%  2025-05-07T20:26:31.2328643Z 2025-05-07T20:26:31.2328658Z 2025-05-07T20:26:31.2328662Z 2025-05-07T20:26:31.2328665Z 2025-05-07T20:26:31.2328669Z 2025-05-07T20:26:31.2328672Z 2025-05-07T20:26:31.2328676Z 2025-05-07T20:26:31.2328679Z 2025-05-07T20:26:31.2328683Z 2025-05-07T20:26:31.2328687Z 2025-05-07T20:26:31.2328718Z 2025-05-07T20:26:31.2328722Z 2025-05-07T20:26:31.2328725Z 2025-05-07T20:26:31.2328729Z 2025-05-07T20:26:31.2328732Z 2025-05-07T20:26:31.2328736Z 2025-05-07T20:26:31.2328739Z 2025-05-07T20:26:31.2859955Z cuda-nvvm-impl-12.6. | 7.7 MB | #######4 | 75%  2025-05-07T20:26:31.2860306Z 2025-05-07T20:26:31.2860310Z 2025-05-07T20:26:31.2860313Z 2025-05-07T20:26:31.2860317Z 2025-05-07T20:26:31.2860320Z 2025-05-07T20:26:31.2860324Z 2025-05-07T20:26:31.2860328Z 2025-05-07T20:26:31.2860331Z 2025-05-07T20:26:31.2860358Z 2025-05-07T20:26:31.2860361Z 2025-05-07T20:26:31.2860365Z 2025-05-07T20:26:31.2860368Z 2025-05-07T20:26:31.2860372Z 2025-05-07T20:26:31.2860375Z 2025-05-07T20:26:31.3249666Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:26:31.3249991Z 2025-05-07T20:26:31.3249994Z 2025-05-07T20:26:31.3249998Z 2025-05-07T20:26:31.3250002Z 2025-05-07T20:26:31.3250020Z 2025-05-07T20:26:31.3250024Z 2025-05-07T20:26:31.3250028Z 2025-05-07T20:26:31.3250031Z 2025-05-07T20:26:31.3250035Z 2025-05-07T20:26:31.3250038Z 2025-05-07T20:26:31.3250042Z 2025-05-07T20:26:31.3250052Z 2025-05-07T20:26:31.3250056Z 2025-05-07T20:26:31.3250059Z 2025-05-07T20:26:31.3250063Z 2025-05-07T20:26:31.3250066Z 2025-05-07T20:26:31.3250070Z 2025-05-07T20:26:31.3251398Z 2025-05-07T20:26:31.3472667Z libglib-2.84.0 | 3.8 MB | | 0%  2025-05-07T20:26:31.3473049Z 2025-05-07T20:26:31.3473070Z 2025-05-07T20:26:31.3473074Z 2025-05-07T20:26:31.3473078Z 2025-05-07T20:26:31.3473081Z 2025-05-07T20:26:31.3473085Z 2025-05-07T20:26:31.3473089Z 2025-05-07T20:26:31.3473092Z 2025-05-07T20:26:31.3473096Z 2025-05-07T20:26:31.3473099Z 2025-05-07T20:26:31.3473103Z 2025-05-07T20:26:31.3473106Z 2025-05-07T20:26:31.3473110Z 2025-05-07T20:26:31.3473113Z 2025-05-07T20:26:31.3478098Z 2025-05-07T20:26:31.3879957Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:26:31.3880335Z 2025-05-07T20:26:31.3880339Z 2025-05-07T20:26:31.3880343Z 2025-05-07T20:26:31.3880346Z 2025-05-07T20:26:31.3880350Z 2025-05-07T20:26:31.3880353Z 2025-05-07T20:26:31.3880357Z 2025-05-07T20:26:31.3880369Z 2025-05-07T20:26:31.3880373Z 2025-05-07T20:26:31.3880376Z 2025-05-07T20:26:31.3880380Z 2025-05-07T20:26:31.3880384Z 2025-05-07T20:26:31.3880387Z 2025-05-07T20:26:31.3880391Z 2025-05-07T20:26:31.3880394Z 2025-05-07T20:26:31.3880398Z 2025-05-07T20:26:31.3880408Z 2025-05-07T20:26:31.3880412Z 2025-05-07T20:26:31.3882964Z 2025-05-07T20:26:31.4257310Z ... (more hidden) ... 2025-05-07T20:26:31.4257628Z 2025-05-07T20:26:31.4257632Z 2025-05-07T20:26:31.4257635Z 2025-05-07T20:26:31.4257639Z 2025-05-07T20:26:31.4257651Z 2025-05-07T20:26:31.4257655Z 2025-05-07T20:26:31.4257658Z 2025-05-07T20:26:31.4257662Z 2025-05-07T20:26:31.4257886Z 2025-05-07T20:26:31.4257891Z 2025-05-07T20:26:31.4257894Z 2025-05-07T20:26:31.4257898Z 2025-05-07T20:26:31.4257901Z 2025-05-07T20:26:31.4257905Z 2025-05-07T20:26:31.4257908Z 2025-05-07T20:26:31.4257912Z 2025-05-07T20:26:31.4257915Z 2025-05-07T20:26:31.4257918Z 2025-05-07T20:26:31.4884808Z libglib-2.84.0 | 3.8 MB | ########5 | 85%  2025-05-07T20:26:31.4885149Z 2025-05-07T20:26:31.4885153Z 2025-05-07T20:26:31.4885157Z 2025-05-07T20:26:31.4885160Z 2025-05-07T20:26:31.4885165Z 2025-05-07T20:26:31.4885429Z 2025-05-07T20:26:31.4885433Z 2025-05-07T20:26:31.4885436Z 2025-05-07T20:26:31.4885440Z 2025-05-07T20:26:31.4885443Z 2025-05-07T20:26:31.4885447Z 2025-05-07T20:26:31.4885450Z 2025-05-07T20:26:31.4885454Z 2025-05-07T20:26:31.4885457Z 2025-05-07T20:26:31.4885468Z 2025-05-07T20:26:31.4885475Z 2025-05-07T20:26:31.4885480Z 2025-05-07T20:26:31.4885485Z 2025-05-07T20:26:31.4885704Z 2025-05-07T20:26:31.5205095Z ... (more hidden) ... 2025-05-07T20:26:31.5205405Z 2025-05-07T20:26:31.5205409Z 2025-05-07T20:26:31.5205413Z 2025-05-07T20:26:31.5205417Z 2025-05-07T20:26:31.5205420Z 2025-05-07T20:26:31.5205424Z 2025-05-07T20:26:31.5205428Z 2025-05-07T20:26:31.5205431Z 2025-05-07T20:26:31.5205435Z 2025-05-07T20:26:31.5205438Z 2025-05-07T20:26:31.5205442Z 2025-05-07T20:26:31.5205445Z 2025-05-07T20:26:31.5205449Z 2025-05-07T20:26:31.5205452Z 2025-05-07T20:26:31.5205456Z 2025-05-07T20:26:31.5206850Z 2025-05-07T20:26:31.5342128Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:26:31.5342503Z 2025-05-07T20:26:31.5342507Z 2025-05-07T20:26:31.5342510Z 2025-05-07T20:26:31.5342514Z 2025-05-07T20:26:31.5342517Z 2025-05-07T20:26:31.5342521Z 2025-05-07T20:26:31.5342524Z 2025-05-07T20:26:31.5342528Z 2025-05-07T20:26:31.5342539Z 2025-05-07T20:26:31.5342543Z 2025-05-07T20:26:31.5342546Z 2025-05-07T20:26:31.5342560Z 2025-05-07T20:26:31.5342566Z 2025-05-07T20:26:31.5342571Z 2025-05-07T20:26:31.5342576Z 2025-05-07T20:26:31.5342579Z 2025-05-07T20:26:31.5342582Z 2025-05-07T20:26:31.5636116Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:26:31.5636698Z 2025-05-07T20:26:31.5636704Z 2025-05-07T20:26:31.5636710Z 2025-05-07T20:26:31.5636716Z 2025-05-07T20:26:31.5636722Z 2025-05-07T20:26:31.5636728Z 2025-05-07T20:26:31.5636734Z 2025-05-07T20:26:31.5636740Z 2025-05-07T20:26:31.5636746Z 2025-05-07T20:26:31.5636770Z 2025-05-07T20:26:31.5636776Z 2025-05-07T20:26:31.5636782Z 2025-05-07T20:26:31.5636787Z 2025-05-07T20:26:31.5636793Z 2025-05-07T20:26:31.5636799Z 2025-05-07T20:26:31.5636804Z 2025-05-07T20:26:31.5636810Z 2025-05-07T20:26:31.5636816Z 2025-05-07T20:26:31.6042546Z libglib-2.84.0 | 3.8 MB | ########## | 100%  2025-05-07T20:26:31.6042889Z 2025-05-07T20:26:31.6042914Z 2025-05-07T20:26:31.6042920Z 2025-05-07T20:26:31.6042935Z 2025-05-07T20:26:31.6042940Z 2025-05-07T20:26:31.6042945Z 2025-05-07T20:26:31.6042951Z 2025-05-07T20:26:31.6042956Z 2025-05-07T20:26:31.6042960Z 2025-05-07T20:26:31.6042965Z 2025-05-07T20:26:31.6042970Z 2025-05-07T20:26:31.6042975Z 2025-05-07T20:26:31.6042980Z 2025-05-07T20:26:31.6042985Z 2025-05-07T20:26:31.6042990Z 2025-05-07T20:26:31.6042995Z 2025-05-07T20:26:31.6043001Z 2025-05-07T20:26:31.6043006Z 2025-05-07T20:26:31.6047246Z 2025-05-07T20:26:32.5933111Z ... (more hidden) ... 2025-05-07T20:26:32.5933474Z 2025-05-07T20:26:32.5933478Z 2025-05-07T20:26:32.5933481Z 2025-05-07T20:26:32.5933485Z 2025-05-07T20:26:32.5933490Z 2025-05-07T20:26:32.5933493Z 2025-05-07T20:26:33.7368004Z libcusolver-11.7.1.2 | 95.8 MB | ########## | 100%  2025-05-07T20:26:33.7368324Z 2025-05-07T20:26:33.7368328Z 2025-05-07T20:26:33.7368332Z 2025-05-07T20:26:33.7368613Z 2025-05-07T20:26:33.7368635Z 2025-05-07T20:26:34.0716038Z cuda-nvvp-12.6.80 | 109.3 MB | ########## | 100%  2025-05-07T20:26:34.0716348Z 2025-05-07T20:26:34.0716352Z 2025-05-07T20:26:34.0716357Z 2025-05-07T20:26:34.0716360Z 2025-05-07T20:26:34.0716365Z 2025-05-07T20:26:34.0716368Z 2025-05-07T20:26:34.0716372Z 2025-05-07T20:26:34.0716385Z 2025-05-07T20:26:34.3072634Z cuda-nvdisasm-12.6.7 | 47.6 MB | ########## | 100%  2025-05-07T20:26:34.3073035Z 2025-05-07T20:26:34.3073039Z 2025-05-07T20:26:34.3073042Z 2025-05-07T20:26:34.3073320Z 2025-05-07T20:26:34.3073323Z 2025-05-07T20:26:34.3073327Z 2025-05-07T20:26:34.3073330Z 2025-05-07T20:26:34.5741789Z libnpp-12.3.1.54 | 93.4 MB | ########## | 100%  2025-05-07T20:26:34.6238356Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:26:34.6238681Z 2025-05-07T20:26:34.6238685Z 2025-05-07T20:26:34.6238688Z 2025-05-07T20:26:34.6238692Z 2025-05-07T20:26:34.6238727Z 2025-05-07T20:26:34.6238731Z 2025-05-07T20:26:34.6238735Z 2025-05-07T20:26:34.6238738Z 2025-05-07T20:26:34.6238742Z 2025-05-07T20:26:34.6836924Z libcurand-10.3.7.77 | 39.9 MB | ########## | 100%  2025-05-07T20:26:34.6837276Z 2025-05-07T20:26:34.6837282Z 2025-05-07T20:26:34.6837286Z 2025-05-07T20:26:34.6837292Z 2025-05-07T20:26:34.6837297Z 2025-05-07T20:26:34.6837304Z 2025-05-07T20:26:34.6837309Z 2025-05-07T20:26:34.6837315Z 2025-05-07T20:26:34.6837321Z 2025-05-07T20:26:34.6837340Z 2025-05-07T20:26:34.6837346Z 2025-05-07T20:26:34.6837396Z 2025-05-07T20:26:34.8474057Z cuda-nvrtc-12.6.85 | 17.3 MB | ########## | 100%  2025-05-07T20:26:34.8474531Z 2025-05-07T20:26:34.8474536Z 2025-05-07T20:26:34.8474541Z 2025-05-07T20:26:34.8474546Z 2025-05-07T20:26:34.8474552Z 2025-05-07T20:26:34.8474558Z 2025-05-07T20:26:34.8474563Z 2025-05-07T20:26:34.8474570Z 2025-05-07T20:26:34.8474577Z 2025-05-07T20:26:34.8474583Z 2025-05-07T20:26:34.8474624Z 2025-05-07T20:26:34.8474631Z 2025-05-07T20:26:34.8474637Z 2025-05-07T20:26:35.0581122Z libnvjitlink-12.6.85 | 14.9 MB | ########## | 100%  2025-05-07T20:26:35.0581478Z 2025-05-07T20:26:35.0581482Z 2025-05-07T20:26:35.0581485Z 2025-05-07T20:26:35.0581489Z 2025-05-07T20:26:35.0581492Z 2025-05-07T20:26:35.0581496Z 2025-05-07T20:26:35.0581499Z 2025-05-07T20:26:35.0581503Z 2025-05-07T20:26:35.0581506Z 2025-05-07T20:26:35.0581519Z 2025-05-07T20:26:35.0584382Z 2025-05-07T20:26:35.0655418Z cuda-nvcc-tools-12.6 | 23.0 MB | ########## | 100%  2025-05-07T20:26:35.0655765Z 2025-05-07T20:26:35.0655770Z 2025-05-07T20:26:35.0655783Z 2025-05-07T20:26:35.0655787Z 2025-05-07T20:26:35.0655790Z 2025-05-07T20:26:35.0655794Z 2025-05-07T20:26:35.0655797Z 2025-05-07T20:26:35.0655801Z 2025-05-07T20:26:35.0655804Z 2025-05-07T20:26:35.0655903Z 2025-05-07T20:26:35.2402937Z gds-tools-1.11.1.6 | 37.8 MB | ########## | 100%  2025-05-07T20:26:35.2403308Z 2025-05-07T20:26:35.2403314Z 2025-05-07T20:26:35.2403318Z 2025-05-07T20:26:35.2403323Z 2025-05-07T20:26:35.2403327Z 2025-05-07T20:26:35.2403332Z 2025-05-07T20:26:35.2403337Z 2025-05-07T20:26:35.2403342Z 2025-05-07T20:26:35.2403347Z 2025-05-07T20:26:35.2403352Z 2025-05-07T20:26:35.2403356Z 2025-05-07T20:26:35.2403360Z 2025-05-07T20:26:35.2403365Z 2025-05-07T20:26:35.2403369Z 2025-05-07T20:26:35.2403373Z 2025-05-07T20:26:35.2964490Z cuda-nvvm-tools-12.6 | 10.4 MB | ########## | 100%  2025-05-07T20:26:35.2964888Z 2025-05-07T20:26:35.2964892Z 2025-05-07T20:26:35.2964896Z 2025-05-07T20:26:35.2964900Z 2025-05-07T20:26:35.2964906Z 2025-05-07T20:26:35.2964910Z 2025-05-07T20:26:35.2964914Z 2025-05-07T20:26:35.2964918Z 2025-05-07T20:26:35.2964922Z 2025-05-07T20:26:35.2964925Z 2025-05-07T20:26:35.2964929Z 2025-05-07T20:26:35.2964932Z 2025-05-07T20:26:35.2964936Z 2025-05-07T20:26:35.2965198Z 2025-05-07T20:26:35.4426468Z cuda-nvcc-dev_linux- | 10.8 MB | ########## | 100%  2025-05-07T20:26:35.4426816Z 2025-05-07T20:26:35.4426820Z 2025-05-07T20:26:35.4426824Z 2025-05-07T20:26:35.4426827Z 2025-05-07T20:26:35.4426831Z 2025-05-07T20:26:35.4426834Z 2025-05-07T20:26:35.4426837Z 2025-05-07T20:26:35.4426841Z 2025-05-07T20:26:35.4426845Z 2025-05-07T20:26:35.4426848Z 2025-05-07T20:26:35.4426852Z 2025-05-07T20:26:35.4426855Z 2025-05-07T20:26:35.4426859Z 2025-05-07T20:26:35.4426862Z 2025-05-07T20:26:35.4427156Z 2025-05-07T20:26:35.4427163Z 2025-05-07T20:26:35.4525876Z cuda-sanitizer-api-1 | 8.9 MB | ########## | 100%  2025-05-07T20:26:35.4526241Z 2025-05-07T20:26:35.4526254Z 2025-05-07T20:26:35.4526257Z 2025-05-07T20:26:35.4526261Z 2025-05-07T20:26:35.4526264Z 2025-05-07T20:26:35.4526268Z 2025-05-07T20:26:35.4526271Z 2025-05-07T20:26:35.4526275Z 2025-05-07T20:26:35.4526278Z 2025-05-07T20:26:35.4526309Z 2025-05-07T20:26:35.4526313Z 2025-05-07T20:26:35.4526317Z 2025-05-07T20:26:35.4526320Z 2025-05-07T20:26:35.4526324Z 2025-05-07T20:26:35.4526327Z 2025-05-07T20:26:35.4526330Z 2025-05-07T20:26:35.4526336Z 2025-05-07T20:26:35.6053672Z cuda-nvvm-impl-12.6. | 7.7 MB | ########## | 100%  2025-05-07T20:26:35.6054028Z 2025-05-07T20:26:35.6054034Z 2025-05-07T20:26:35.6054040Z 2025-05-07T20:26:35.6054047Z 2025-05-07T20:26:35.6054052Z 2025-05-07T20:26:35.6054058Z 2025-05-07T20:26:35.6054065Z 2025-05-07T20:26:35.6054108Z 2025-05-07T20:26:35.6054113Z 2025-05-07T20:26:35.6054118Z 2025-05-07T20:26:35.6054123Z 2025-05-07T20:26:35.6054139Z 2025-05-07T20:26:35.6054143Z 2025-05-07T20:26:35.6054148Z 2025-05-07T20:26:35.6054152Z 2025-05-07T20:26:35.6054157Z 2025-05-07T20:26:35.6054161Z 2025-05-07T20:26:35.6054167Z 2025-05-07T20:26:35.6054174Z 2025-05-07T20:26:35.6509515Z ... (more hidden) ... 2025-05-07T20:26:35.6509836Z 2025-05-07T20:26:35.6509840Z 2025-05-07T20:26:35.6509844Z 2025-05-07T20:26:35.6509847Z 2025-05-07T20:26:35.6509851Z 2025-05-07T20:26:35.6509854Z 2025-05-07T20:26:35.6509858Z 2025-05-07T20:26:35.6509862Z 2025-05-07T20:26:35.6509866Z 2025-05-07T20:26:35.6509869Z 2025-05-07T20:26:35.6509872Z 2025-05-07T20:26:35.6509876Z 2025-05-07T20:26:35.6509879Z 2025-05-07T20:26:35.6509883Z 2025-05-07T20:26:35.6509886Z 2025-05-07T20:26:35.6509889Z 2025-05-07T20:26:35.6509893Z 2025-05-07T20:26:35.6509899Z 2025-05-07T20:26:37.0222939Z libglib-2.84.0 | 3.8 MB | ########## | 100%  2025-05-07T20:26:37.0224095Z 2025-05-07T20:26:41.7301029Z libcublas-12.6.4.1 | 256.2 MB | ########## | 100%  2025-05-07T20:26:41.7309284Z nsight-compute-2024. | 443.1 MB | ########## | 100% 2025-05-07T20:26:41.7309685Z 2025-05-07T20:26:41.7309691Z 2025-05-07T20:26:41.7309696Z 2025-05-07T20:26:41.7309701Z 2025-05-07T20:26:41.7309730Z 2025-05-07T20:26:41.7309735Z 2025-05-07T20:26:41.7309740Z 2025-05-07T20:26:41.7309745Z 2025-05-07T20:26:41.7309750Z 2025-05-07T20:26:41.7309755Z 2025-05-07T20:26:41.7309762Z 2025-05-07T20:26:41.7309768Z 2025-05-07T20:26:41.7309773Z 2025-05-07T20:26:41.7309778Z 2025-05-07T20:26:41.7309783Z 2025-05-07T20:26:41.7309788Z 2025-05-07T20:26:41.7309792Z 2025-05-07T20:26:41.7309798Z 2025-05-07T20:26:41.7309802Z 2025-05-07T20:26:41.7309925Z 2025-05-07T20:26:41.7310382Z  2025-05-07T20:26:41.7310862Z 2025-05-07T20:26:41.7311163Z 2025-05-07T20:26:41.7311391Z  2025-05-07T20:26:41.7311680Z 2025-05-07T20:26:41.7311686Z 2025-05-07T20:26:41.7311919Z  2025-05-07T20:26:41.7312209Z 2025-05-07T20:26:41.7312215Z 2025-05-07T20:26:41.7312484Z 2025-05-07T20:26:41.7312753Z  2025-05-07T20:26:41.7313051Z 2025-05-07T20:26:41.7313057Z 2025-05-07T20:26:41.7313062Z 2025-05-07T20:26:41.7313067Z 2025-05-07T20:26:41.7313310Z  2025-05-07T20:26:41.7313604Z 2025-05-07T20:26:41.7313609Z 2025-05-07T20:26:41.7313614Z 2025-05-07T20:26:41.7313619Z 2025-05-07T20:26:41.7313624Z 2025-05-07T20:26:41.7314017Z  2025-05-07T20:26:41.7314331Z 2025-05-07T20:26:41.7314665Z 2025-05-07T20:26:41.7314670Z 2025-05-07T20:26:41.7314675Z 2025-05-07T20:26:41.7314680Z 2025-05-07T20:26:41.7314698Z 2025-05-07T20:26:41.7314968Z  2025-05-07T20:26:41.7315278Z 2025-05-07T20:26:41.7315283Z 2025-05-07T20:26:41.7315288Z 2025-05-07T20:26:41.7315293Z 2025-05-07T20:26:41.7315308Z 2025-05-07T20:26:41.7315313Z 2025-05-07T20:26:41.7315327Z 2025-05-07T20:26:41.7315577Z  2025-05-07T20:26:41.7315891Z 2025-05-07T20:26:41.7315896Z 2025-05-07T20:26:41.7315901Z 2025-05-07T20:26:41.7315915Z 2025-05-07T20:26:41.7315920Z 2025-05-07T20:26:41.7315925Z 2025-05-07T20:26:41.7315930Z 2025-05-07T20:26:41.7315935Z 2025-05-07T20:26:41.7316185Z  2025-05-07T20:26:41.7316499Z 2025-05-07T20:26:41.7316504Z 2025-05-07T20:26:41.7316509Z 2025-05-07T20:26:41.7316526Z 2025-05-07T20:26:41.7316531Z 2025-05-07T20:26:41.7316536Z 2025-05-07T20:26:41.7316542Z 2025-05-07T20:26:41.7316548Z 2025-05-07T20:26:41.7316553Z 2025-05-07T20:26:41.7316803Z  2025-05-07T20:26:41.7317121Z 2025-05-07T20:26:41.7317126Z 2025-05-07T20:26:41.7317131Z 2025-05-07T20:26:41.7317136Z 2025-05-07T20:26:41.7317144Z 2025-05-07T20:26:41.7317156Z 2025-05-07T20:26:41.7317161Z 2025-05-07T20:26:41.7317166Z 2025-05-07T20:26:41.7317171Z 2025-05-07T20:26:41.7317176Z 2025-05-07T20:26:41.7317648Z  2025-05-07T20:26:41.7317905Z 2025-05-07T20:26:41.7317909Z 2025-05-07T20:26:41.7317912Z 2025-05-07T20:26:41.7317916Z 2025-05-07T20:26:41.7317919Z 2025-05-07T20:26:41.7317923Z 2025-05-07T20:26:41.7317933Z 2025-05-07T20:26:41.7317937Z 2025-05-07T20:26:41.7317940Z 2025-05-07T20:26:41.7317944Z 2025-05-07T20:26:41.7317947Z 2025-05-07T20:26:41.7318443Z  2025-05-07T20:26:41.7318698Z 2025-05-07T20:26:41.7318702Z 2025-05-07T20:26:41.7318714Z 2025-05-07T20:26:41.7318718Z 2025-05-07T20:26:41.7318722Z 2025-05-07T20:26:41.7318725Z 2025-05-07T20:26:41.7318729Z 2025-05-07T20:26:41.7318732Z 2025-05-07T20:26:41.7318736Z 2025-05-07T20:26:41.7318739Z 2025-05-07T20:26:41.7318748Z 2025-05-07T20:26:41.7318764Z 2025-05-07T20:26:41.7319008Z  2025-05-07T20:26:41.7319279Z 2025-05-07T20:26:41.7319283Z 2025-05-07T20:26:41.7319286Z 2025-05-07T20:26:41.7319295Z 2025-05-07T20:26:41.7319299Z 2025-05-07T20:26:41.7319302Z 2025-05-07T20:26:41.7319306Z 2025-05-07T20:26:41.7319309Z 2025-05-07T20:26:41.7319313Z 2025-05-07T20:26:41.7319316Z 2025-05-07T20:26:41.7319320Z 2025-05-07T20:26:41.7319323Z 2025-05-07T20:26:41.7319327Z 2025-05-07T20:26:41.7319761Z  2025-05-07T20:26:41.7320048Z 2025-05-07T20:26:41.7320052Z 2025-05-07T20:26:41.7320061Z 2025-05-07T20:26:41.7320064Z 2025-05-07T20:26:41.7320068Z 2025-05-07T20:26:41.7320071Z 2025-05-07T20:26:41.7320075Z 2025-05-07T20:26:41.7320078Z 2025-05-07T20:26:41.7320082Z 2025-05-07T20:26:41.7320098Z 2025-05-07T20:26:41.7320102Z 2025-05-07T20:26:41.7320105Z 2025-05-07T20:26:41.7320711Z 2025-05-07T20:26:41.7320717Z 2025-05-07T20:26:41.7320958Z  2025-05-07T20:26:41.7321207Z 2025-05-07T20:26:41.7321211Z 2025-05-07T20:26:41.7321214Z 2025-05-07T20:26:41.7321218Z 2025-05-07T20:26:41.7321221Z 2025-05-07T20:26:41.7321225Z 2025-05-07T20:26:41.7321228Z 2025-05-07T20:26:41.7321232Z 2025-05-07T20:26:41.7321235Z 2025-05-07T20:26:41.7321239Z 2025-05-07T20:26:41.7321242Z 2025-05-07T20:26:41.7321246Z 2025-05-07T20:26:41.7321249Z 2025-05-07T20:26:41.7321336Z 2025-05-07T20:26:41.7321339Z 2025-05-07T20:26:41.7321562Z  2025-05-07T20:26:41.7321865Z 2025-05-07T20:26:41.7321871Z 2025-05-07T20:26:41.7321876Z 2025-05-07T20:26:41.7321882Z 2025-05-07T20:26:41.7321887Z 2025-05-07T20:26:41.7321893Z 2025-05-07T20:26:41.7321898Z 2025-05-07T20:26:41.7321903Z 2025-05-07T20:26:41.7321916Z 2025-05-07T20:26:41.7321920Z 2025-05-07T20:26:41.7321924Z 2025-05-07T20:26:41.7321927Z 2025-05-07T20:26:41.7321931Z 2025-05-07T20:26:41.7321943Z 2025-05-07T20:26:41.7321946Z 2025-05-07T20:26:41.7321950Z 2025-05-07T20:26:41.7322180Z  2025-05-07T20:26:41.7322442Z 2025-05-07T20:26:41.7322459Z 2025-05-07T20:26:41.7322464Z 2025-05-07T20:26:41.7322469Z 2025-05-07T20:26:41.7322474Z 2025-05-07T20:26:41.7322479Z 2025-05-07T20:26:41.7322484Z 2025-05-07T20:26:41.7322489Z 2025-05-07T20:26:41.7322503Z 2025-05-07T20:26:41.7322509Z 2025-05-07T20:26:41.7322514Z 2025-05-07T20:26:41.7322519Z 2025-05-07T20:26:41.7322524Z 2025-05-07T20:26:41.7322529Z 2025-05-07T20:26:41.7322534Z 2025-05-07T20:26:41.7322538Z 2025-05-07T20:26:41.7322543Z 2025-05-07T20:26:41.7322826Z  2025-05-07T20:26:41.7323075Z 2025-05-07T20:26:41.7323085Z 2025-05-07T20:26:41.7323089Z 2025-05-07T20:26:41.7323092Z 2025-05-07T20:26:41.7323096Z 2025-05-07T20:26:41.7323099Z 2025-05-07T20:26:41.7323103Z 2025-05-07T20:26:41.7323106Z 2025-05-07T20:26:41.7323109Z 2025-05-07T20:26:41.7323113Z 2025-05-07T20:26:41.7323116Z 2025-05-07T20:26:41.7323127Z 2025-05-07T20:26:41.7323131Z 2025-05-07T20:26:41.7323134Z 2025-05-07T20:26:41.7323138Z 2025-05-07T20:26:41.7323141Z 2025-05-07T20:26:41.7323145Z 2025-05-07T20:26:41.7323148Z 2025-05-07T20:26:41.7323879Z  2025-05-07T20:26:41.7324153Z 2025-05-07T20:26:41.7324158Z 2025-05-07T20:26:41.7324476Z  2025-05-07T20:26:41.7324650Z 2025-05-07T20:26:41.7324660Z 2025-05-07T20:26:41.7325140Z  2025-05-07T20:26:41.7325331Z 2025-05-07T20:26:41.7325340Z 2025-05-07T20:26:41.7325346Z 2025-05-07T20:26:41.7326011Z  2025-05-07T20:26:41.7326128Z 2025-05-07T20:26:41.7326132Z 2025-05-07T20:26:41.7326152Z 2025-05-07T20:26:41.7326156Z 2025-05-07T20:26:41.7326720Z  2025-05-07T20:26:41.7326845Z 2025-05-07T20:26:41.7326849Z 2025-05-07T20:26:41.7326852Z 2025-05-07T20:26:41.7326856Z 2025-05-07T20:26:41.7326869Z 2025-05-07T20:26:41.7327447Z  2025-05-07T20:26:41.7327621Z 2025-05-07T20:26:41.7327627Z 2025-05-07T20:26:41.7327637Z 2025-05-07T20:26:41.7327642Z 2025-05-07T20:26:41.7327647Z 2025-05-07T20:26:41.7327653Z 2025-05-07T20:26:41.7328176Z  2025-05-07T20:26:41.7328324Z 2025-05-07T20:26:41.7328337Z 2025-05-07T20:26:41.7328349Z 2025-05-07T20:26:41.7328353Z 2025-05-07T20:26:41.7328360Z 2025-05-07T20:26:41.7328363Z 2025-05-07T20:26:41.7328367Z 2025-05-07T20:26:41.7328802Z  2025-05-07T20:26:41.7328951Z 2025-05-07T20:26:41.7328959Z 2025-05-07T20:26:41.7328963Z 2025-05-07T20:26:41.7328966Z 2025-05-07T20:26:41.7328978Z 2025-05-07T20:26:41.7328982Z 2025-05-07T20:26:41.7328986Z 2025-05-07T20:26:41.7328989Z 2025-05-07T20:26:41.7329643Z  2025-05-07T20:26:41.7329816Z 2025-05-07T20:26:41.7329828Z 2025-05-07T20:26:41.7329832Z 2025-05-07T20:26:41.7329835Z 2025-05-07T20:26:41.7329839Z 2025-05-07T20:26:41.7329842Z 2025-05-07T20:26:41.7329846Z 2025-05-07T20:26:41.7329849Z 2025-05-07T20:26:41.7329853Z 2025-05-07T20:26:41.7330172Z  2025-05-07T20:26:41.7330379Z 2025-05-07T20:26:41.7330383Z 2025-05-07T20:26:41.7330387Z 2025-05-07T20:26:41.7330396Z 2025-05-07T20:26:41.7330400Z 2025-05-07T20:26:41.7330403Z 2025-05-07T20:26:41.7330407Z 2025-05-07T20:26:41.7330579Z 2025-05-07T20:26:41.7330583Z 2025-05-07T20:26:41.7330586Z 2025-05-07T20:26:41.7330880Z  2025-05-07T20:26:41.7331091Z 2025-05-07T20:26:41.7331095Z 2025-05-07T20:26:41.7331103Z 2025-05-07T20:26:41.7331107Z 2025-05-07T20:26:41.7331110Z 2025-05-07T20:26:41.7331114Z 2025-05-07T20:26:41.7331117Z 2025-05-07T20:26:41.7331121Z 2025-05-07T20:26:41.7331124Z 2025-05-07T20:26:41.7331137Z 2025-05-07T20:26:41.7331141Z 2025-05-07T20:26:41.7331459Z  2025-05-07T20:26:41.7331649Z 2025-05-07T20:26:41.7331653Z 2025-05-07T20:26:41.7331661Z 2025-05-07T20:26:41.7331665Z 2025-05-07T20:26:41.7331668Z 2025-05-07T20:26:41.7331672Z 2025-05-07T20:26:41.7331675Z 2025-05-07T20:26:41.7331679Z 2025-05-07T20:26:41.7331682Z 2025-05-07T20:26:41.7331686Z 2025-05-07T20:26:41.7331690Z 2025-05-07T20:26:41.7331693Z 2025-05-07T20:26:41.7332148Z  2025-05-07T20:26:41.7332366Z 2025-05-07T20:26:41.7332384Z 2025-05-07T20:26:41.7332388Z 2025-05-07T20:26:41.7332391Z 2025-05-07T20:26:41.7332395Z 2025-05-07T20:26:41.7332398Z 2025-05-07T20:26:41.7332402Z 2025-05-07T20:26:41.7332405Z 2025-05-07T20:26:41.7332409Z 2025-05-07T20:26:41.7332412Z 2025-05-07T20:26:41.7332415Z 2025-05-07T20:26:41.7332419Z 2025-05-07T20:26:41.7332422Z 2025-05-07T20:26:41.7332813Z  2025-05-07T20:26:41.7333094Z 2025-05-07T20:26:41.7333118Z 2025-05-07T20:26:41.7333123Z 2025-05-07T20:26:41.7333129Z 2025-05-07T20:26:41.7333134Z 2025-05-07T20:26:41.7333139Z 2025-05-07T20:26:41.7333144Z 2025-05-07T20:26:41.7333149Z 2025-05-07T20:26:41.7333154Z 2025-05-07T20:26:41.7333159Z 2025-05-07T20:26:41.7333176Z 2025-05-07T20:26:41.7333180Z 2025-05-07T20:26:41.7333186Z 2025-05-07T20:26:41.7333190Z 2025-05-07T20:26:41.7333545Z  2025-05-07T20:26:41.7333805Z 2025-05-07T20:26:41.7333809Z 2025-05-07T20:26:41.7333819Z 2025-05-07T20:26:41.7333822Z 2025-05-07T20:26:41.7333833Z 2025-05-07T20:26:41.7333837Z 2025-05-07T20:26:41.7333840Z 2025-05-07T20:26:41.7333844Z 2025-05-07T20:26:41.7333848Z 2025-05-07T20:26:41.7333851Z 2025-05-07T20:26:41.7333855Z 2025-05-07T20:26:41.7333858Z 2025-05-07T20:26:41.7333862Z 2025-05-07T20:26:41.7333866Z 2025-05-07T20:26:41.7333870Z 2025-05-07T20:26:41.7334235Z  2025-05-07T20:26:41.7334637Z 2025-05-07T20:26:41.7334658Z 2025-05-07T20:26:41.7334663Z 2025-05-07T20:26:41.7334668Z 2025-05-07T20:26:41.7334673Z 2025-05-07T20:26:41.7334678Z 2025-05-07T20:26:41.7334683Z 2025-05-07T20:26:41.7334688Z 2025-05-07T20:26:41.7334693Z 2025-05-07T20:26:41.7334698Z 2025-05-07T20:26:41.7334703Z 2025-05-07T20:26:41.7334708Z 2025-05-07T20:26:41.7334725Z 2025-05-07T20:26:41.7334730Z 2025-05-07T20:26:41.7334735Z 2025-05-07T20:26:41.7334740Z 2025-05-07T20:26:41.7335070Z  2025-05-07T20:26:41.7335387Z 2025-05-07T20:26:41.7335412Z 2025-05-07T20:26:41.7335427Z 2025-05-07T20:26:41.7335432Z 2025-05-07T20:26:41.7335437Z 2025-05-07T20:26:41.7335442Z 2025-05-07T20:26:41.7335447Z 2025-05-07T20:26:41.7335452Z 2025-05-07T20:26:41.7335457Z 2025-05-07T20:26:41.7335462Z 2025-05-07T20:26:41.7335467Z 2025-05-07T20:26:41.7335472Z 2025-05-07T20:26:41.7335477Z 2025-05-07T20:26:41.7335482Z 2025-05-07T20:26:41.7335487Z 2025-05-07T20:26:41.7335492Z 2025-05-07T20:26:41.7335654Z 2025-05-07T20:26:41.7335903Z  2025-05-07T20:26:41.7336201Z 2025-05-07T20:26:41.7336206Z 2025-05-07T20:26:41.7336211Z 2025-05-07T20:26:41.7336216Z 2025-05-07T20:26:41.7336221Z 2025-05-07T20:26:41.7336226Z 2025-05-07T20:26:41.7336231Z 2025-05-07T20:26:41.7336236Z 2025-05-07T20:26:41.7336241Z 2025-05-07T20:26:41.7336245Z 2025-05-07T20:26:41.7336250Z 2025-05-07T20:26:41.7336255Z 2025-05-07T20:26:41.7336269Z 2025-05-07T20:26:41.7336275Z 2025-05-07T20:26:41.7336279Z 2025-05-07T20:26:41.7336284Z 2025-05-07T20:26:41.7336384Z 2025-05-07T20:26:41.7336389Z 2025-05-07T20:26:41.7337002Z  2025-05-07T20:26:41.7337320Z 2025-05-07T20:26:41.7337326Z 2025-05-07T20:26:41.7337490Z  2025-05-07T20:26:41.7337631Z 2025-05-07T20:26:41.7337639Z 2025-05-07T20:26:41.7338161Z  2025-05-07T20:26:41.7338281Z 2025-05-07T20:26:41.7338285Z 2025-05-07T20:26:41.7338300Z 2025-05-07T20:26:41.7338903Z  2025-05-07T20:26:41.7339077Z 2025-05-07T20:26:41.7339083Z 2025-05-07T20:26:41.7339088Z 2025-05-07T20:26:41.7339093Z 2025-05-07T20:26:41.7339334Z  2025-05-07T20:26:41.7339501Z 2025-05-07T20:26:41.7339509Z 2025-05-07T20:26:41.7339515Z 2025-05-07T20:26:41.7339520Z 2025-05-07T20:26:41.7339525Z 2025-05-07T20:26:41.7339999Z  2025-05-07T20:26:41.7340135Z 2025-05-07T20:26:41.7340139Z 2025-05-07T20:26:41.7340142Z 2025-05-07T20:26:41.7340146Z 2025-05-07T20:26:41.7340159Z 2025-05-07T20:26:41.7340165Z 2025-05-07T20:26:41.7340519Z  2025-05-07T20:26:41.7340723Z 2025-05-07T20:26:41.7340729Z 2025-05-07T20:26:41.7340740Z 2025-05-07T20:26:41.7340746Z 2025-05-07T20:26:41.7340762Z 2025-05-07T20:26:41.7340767Z 2025-05-07T20:26:41.7340772Z 2025-05-07T20:26:41.7341145Z  2025-05-07T20:26:41.7341353Z 2025-05-07T20:26:41.7341358Z 2025-05-07T20:26:41.7341363Z 2025-05-07T20:26:41.7341374Z 2025-05-07T20:26:41.7341378Z 2025-05-07T20:26:41.7341394Z 2025-05-07T20:26:41.7341400Z 2025-05-07T20:26:41.7341405Z 2025-05-07T20:26:41.7341866Z  2025-05-07T20:26:41.7342085Z 2025-05-07T20:26:41.7342091Z 2025-05-07T20:26:41.7342096Z 2025-05-07T20:26:41.7342102Z 2025-05-07T20:26:41.7342107Z 2025-05-07T20:26:41.7342112Z 2025-05-07T20:26:41.7342117Z 2025-05-07T20:26:41.7342139Z 2025-05-07T20:26:41.7342144Z 2025-05-07T20:26:41.7342346Z  2025-05-07T20:26:41.7342564Z 2025-05-07T20:26:41.7342575Z 2025-05-07T20:26:41.7342590Z 2025-05-07T20:26:41.7342595Z 2025-05-07T20:26:41.7342611Z 2025-05-07T20:26:41.7342616Z 2025-05-07T20:26:41.7342621Z 2025-05-07T20:26:41.7342626Z 2025-05-07T20:26:41.7342631Z 2025-05-07T20:26:41.7342636Z 2025-05-07T20:26:41.7342976Z  2025-05-07T20:26:41.7343208Z 2025-05-07T20:26:41.7343218Z 2025-05-07T20:26:41.7343223Z 2025-05-07T20:26:41.7343228Z 2025-05-07T20:26:41.7343233Z 2025-05-07T20:26:41.7343239Z 2025-05-07T20:26:41.7343244Z 2025-05-07T20:26:41.7343255Z 2025-05-07T20:26:41.7343261Z 2025-05-07T20:26:41.7343266Z 2025-05-07T20:26:41.7343271Z 2025-05-07T20:26:41.7343594Z  2025-05-07T20:26:41.7343830Z 2025-05-07T20:26:41.7343834Z 2025-05-07T20:26:41.7343851Z 2025-05-07T20:26:41.7343855Z 2025-05-07T20:26:41.7343858Z 2025-05-07T20:26:41.7343861Z 2025-05-07T20:26:41.7343865Z 2025-05-07T20:26:41.7343868Z 2025-05-07T20:26:41.7343872Z 2025-05-07T20:26:41.7343875Z 2025-05-07T20:26:41.7343879Z 2025-05-07T20:26:41.7343882Z 2025-05-07T20:26:41.7344190Z  2025-05-07T20:26:41.7344445Z 2025-05-07T20:26:41.7344454Z 2025-05-07T20:26:41.7344458Z 2025-05-07T20:26:41.7344462Z 2025-05-07T20:26:41.7344465Z 2025-05-07T20:26:41.7344469Z 2025-05-07T20:26:41.7344472Z 2025-05-07T20:26:41.7344476Z 2025-05-07T20:26:41.7344479Z 2025-05-07T20:26:41.7344483Z 2025-05-07T20:26:41.7344486Z 2025-05-07T20:26:41.7344490Z 2025-05-07T20:26:41.7344493Z 2025-05-07T20:26:41.7344955Z  2025-05-07T20:26:41.7345246Z 2025-05-07T20:26:41.7345258Z 2025-05-07T20:26:41.7345264Z 2025-05-07T20:26:41.7345269Z 2025-05-07T20:26:41.7345274Z 2025-05-07T20:26:41.7345279Z 2025-05-07T20:26:41.7345284Z 2025-05-07T20:26:41.7345290Z 2025-05-07T20:26:41.7345295Z 2025-05-07T20:26:41.7345302Z 2025-05-07T20:26:41.7345308Z 2025-05-07T20:26:41.7345315Z 2025-05-07T20:26:41.7345321Z 2025-05-07T20:26:41.7345326Z 2025-05-07T20:26:41.7345544Z  2025-05-07T20:26:41.7345821Z 2025-05-07T20:26:41.7345955Z 2025-05-07T20:26:41.7345960Z 2025-05-07T20:26:41.7345965Z 2025-05-07T20:26:41.7345970Z 2025-05-07T20:26:41.7345975Z 2025-05-07T20:26:41.7345992Z 2025-05-07T20:26:41.7345997Z 2025-05-07T20:26:41.7346003Z 2025-05-07T20:26:41.7346007Z 2025-05-07T20:26:41.7346014Z 2025-05-07T20:26:41.7346021Z 2025-05-07T20:26:41.7346026Z 2025-05-07T20:26:41.7346031Z 2025-05-07T20:26:41.7346037Z 2025-05-07T20:26:41.7346261Z  2025-05-07T20:26:41.7346552Z 2025-05-07T20:26:41.7346557Z 2025-05-07T20:26:41.7346562Z 2025-05-07T20:26:41.7346567Z 2025-05-07T20:26:41.7346572Z 2025-05-07T20:26:41.7346577Z 2025-05-07T20:26:41.7346582Z 2025-05-07T20:26:41.7346587Z 2025-05-07T20:26:41.7346592Z 2025-05-07T20:26:41.7346597Z 2025-05-07T20:26:41.7346602Z 2025-05-07T20:26:41.7346607Z 2025-05-07T20:26:41.7346613Z 2025-05-07T20:26:41.7346618Z 2025-05-07T20:26:41.7346623Z 2025-05-07T20:26:41.7346628Z 2025-05-07T20:26:41.7346854Z  2025-05-07T20:26:41.7347151Z 2025-05-07T20:26:41.7347156Z 2025-05-07T20:26:41.7347161Z 2025-05-07T20:26:41.7347167Z 2025-05-07T20:26:41.7347171Z 2025-05-07T20:26:41.7347176Z 2025-05-07T20:26:41.7347181Z 2025-05-07T20:26:41.7347186Z 2025-05-07T20:26:41.7347191Z 2025-05-07T20:26:41.7347205Z 2025-05-07T20:26:41.7347210Z 2025-05-07T20:26:41.7347215Z 2025-05-07T20:26:41.7347220Z 2025-05-07T20:26:41.7347225Z 2025-05-07T20:26:41.7347237Z 2025-05-07T20:26:41.7347242Z 2025-05-07T20:26:41.7347256Z 2025-05-07T20:26:41.7347470Z  2025-05-07T20:26:41.7347772Z 2025-05-07T20:26:41.7347777Z 2025-05-07T20:26:41.7347782Z 2025-05-07T20:26:41.7347787Z 2025-05-07T20:26:41.7347792Z 2025-05-07T20:26:41.7347798Z 2025-05-07T20:26:41.7347803Z 2025-05-07T20:26:41.7347808Z 2025-05-07T20:26:41.7347813Z 2025-05-07T20:26:41.7347818Z 2025-05-07T20:26:41.7347823Z 2025-05-07T20:26:41.7347828Z 2025-05-07T20:26:41.7347833Z 2025-05-07T20:26:41.7347838Z 2025-05-07T20:26:41.7347848Z 2025-05-07T20:26:41.7347853Z 2025-05-07T20:26:41.7347858Z 2025-05-07T20:26:41.7347865Z 2025-05-07T20:26:41.7348564Z  2025-05-07T20:26:41.7348868Z 2025-05-07T20:26:41.7348877Z 2025-05-07T20:26:41.7349032Z  2025-05-07T20:26:41.7349176Z 2025-05-07T20:26:41.7349181Z 2025-05-07T20:26:41.7349447Z  2025-05-07T20:26:41.7349613Z 2025-05-07T20:26:41.7349630Z 2025-05-07T20:26:41.7349636Z 2025-05-07T20:26:41.7349890Z  2025-05-07T20:26:41.7350040Z 2025-05-07T20:26:41.7350049Z 2025-05-07T20:26:41.7350054Z 2025-05-07T20:26:41.7350059Z 2025-05-07T20:26:41.7350464Z  2025-05-07T20:26:41.7350630Z 2025-05-07T20:26:41.7350635Z 2025-05-07T20:26:41.7350644Z 2025-05-07T20:26:41.7350649Z 2025-05-07T20:26:41.7350654Z 2025-05-07T20:26:41.7351007Z  2025-05-07T20:26:41.7351191Z 2025-05-07T20:26:41.7351202Z 2025-05-07T20:26:41.7351208Z 2025-05-07T20:26:41.7351213Z 2025-05-07T20:26:41.7351218Z 2025-05-07T20:26:41.7351232Z 2025-05-07T20:26:41.7351599Z  2025-05-07T20:26:41.7351790Z 2025-05-07T20:26:41.7351800Z 2025-05-07T20:26:41.7351805Z 2025-05-07T20:26:41.7351810Z 2025-05-07T20:26:41.7351815Z 2025-05-07T20:26:41.7351820Z 2025-05-07T20:26:41.7351825Z 2025-05-07T20:26:41.7352216Z  2025-05-07T20:26:41.7352382Z 2025-05-07T20:26:41.7352386Z 2025-05-07T20:26:41.7352390Z 2025-05-07T20:26:41.7352510Z 2025-05-07T20:26:41.7352527Z 2025-05-07T20:26:41.7352530Z 2025-05-07T20:26:41.7352540Z 2025-05-07T20:26:41.7352544Z 2025-05-07T20:26:41.7352707Z  2025-05-07T20:26:41.7352902Z 2025-05-07T20:26:41.7352906Z 2025-05-07T20:26:41.7352918Z 2025-05-07T20:26:41.7352921Z 2025-05-07T20:26:41.7352925Z 2025-05-07T20:26:41.7352929Z 2025-05-07T20:26:41.7352932Z 2025-05-07T20:26:41.7352936Z 2025-05-07T20:26:41.7352942Z 2025-05-07T20:26:41.7353278Z  2025-05-07T20:26:41.7353514Z 2025-05-07T20:26:41.7353520Z 2025-05-07T20:26:41.7353649Z 2025-05-07T20:26:41.7353654Z 2025-05-07T20:26:41.7353660Z 2025-05-07T20:26:41.7353665Z 2025-05-07T20:26:41.7353670Z 2025-05-07T20:26:41.7353680Z 2025-05-07T20:26:41.7353685Z 2025-05-07T20:26:41.7353690Z 2025-05-07T20:26:41.7353881Z  2025-05-07T20:26:41.7354112Z 2025-05-07T20:26:41.7354117Z 2025-05-07T20:26:41.7354122Z 2025-05-07T20:26:41.7354132Z 2025-05-07T20:26:41.7354137Z 2025-05-07T20:26:41.7354151Z 2025-05-07T20:26:41.7354157Z 2025-05-07T20:26:41.7354162Z 2025-05-07T20:26:41.7354167Z 2025-05-07T20:26:41.7354173Z 2025-05-07T20:26:41.7354178Z 2025-05-07T20:26:41.7354378Z  2025-05-07T20:26:41.7354621Z 2025-05-07T20:26:41.7354626Z 2025-05-07T20:26:41.7354631Z 2025-05-07T20:26:41.7354636Z 2025-05-07T20:26:41.7354641Z 2025-05-07T20:26:41.7354646Z 2025-05-07T20:26:41.7354655Z 2025-05-07T20:26:41.7354670Z 2025-05-07T20:26:41.7354675Z 2025-05-07T20:26:41.7354680Z 2025-05-07T20:26:41.7354685Z 2025-05-07T20:26:41.7354697Z 2025-05-07T20:26:41.7354890Z  2025-05-07T20:26:41.7355207Z 2025-05-07T20:26:41.7355213Z 2025-05-07T20:26:41.7355218Z 2025-05-07T20:26:41.7355223Z 2025-05-07T20:26:41.7355228Z 2025-05-07T20:26:41.7355233Z 2025-05-07T20:26:41.7355238Z 2025-05-07T20:26:41.7355243Z 2025-05-07T20:26:41.7355248Z 2025-05-07T20:26:41.7355253Z 2025-05-07T20:26:41.7355258Z 2025-05-07T20:26:41.7355272Z 2025-05-07T20:26:41.7355277Z 2025-05-07T20:26:41.7355511Z  2025-05-07T20:26:41.7355784Z 2025-05-07T20:26:41.7355789Z 2025-05-07T20:26:41.7355800Z 2025-05-07T20:26:41.7355805Z 2025-05-07T20:26:41.7355810Z 2025-05-07T20:26:41.7355815Z 2025-05-07T20:26:41.7355820Z 2025-05-07T20:26:41.7355825Z 2025-05-07T20:26:41.7355830Z 2025-05-07T20:26:41.7355836Z 2025-05-07T20:26:41.7355841Z 2025-05-07T20:26:41.7355846Z 2025-05-07T20:26:41.7355851Z 2025-05-07T20:26:41.7355857Z 2025-05-07T20:26:41.7356072Z  2025-05-07T20:26:41.7356352Z 2025-05-07T20:26:41.7356357Z 2025-05-07T20:26:41.7356362Z 2025-05-07T20:26:41.7356366Z 2025-05-07T20:26:41.7356371Z 2025-05-07T20:26:41.7356376Z 2025-05-07T20:26:41.7356381Z 2025-05-07T20:26:41.7356397Z 2025-05-07T20:26:41.7356402Z 2025-05-07T20:26:41.7356408Z 2025-05-07T20:26:41.7356413Z 2025-05-07T20:26:41.7356417Z 2025-05-07T20:26:41.7356423Z 2025-05-07T20:26:41.7356427Z 2025-05-07T20:26:41.7356444Z 2025-05-07T20:26:41.7356648Z  2025-05-07T20:26:41.7356944Z 2025-05-07T20:26:41.7356949Z 2025-05-07T20:26:41.7356955Z 2025-05-07T20:26:41.7356960Z 2025-05-07T20:26:41.7356974Z 2025-05-07T20:26:41.7356978Z 2025-05-07T20:26:41.7356983Z 2025-05-07T20:26:41.7356988Z 2025-05-07T20:26:41.7356993Z 2025-05-07T20:26:41.7356998Z 2025-05-07T20:26:41.7357003Z 2025-05-07T20:26:41.7357008Z 2025-05-07T20:26:41.7357013Z 2025-05-07T20:26:41.7357018Z 2025-05-07T20:26:41.7357023Z 2025-05-07T20:26:41.7357028Z 2025-05-07T20:26:41.7357253Z  2025-05-07T20:26:41.7357542Z 2025-05-07T20:26:41.7357547Z 2025-05-07T20:26:41.7357552Z 2025-05-07T20:26:41.7357557Z 2025-05-07T20:26:41.7357563Z 2025-05-07T20:26:41.7357568Z 2025-05-07T20:26:41.7357573Z 2025-05-07T20:26:41.7357578Z 2025-05-07T20:26:41.7357584Z 2025-05-07T20:26:41.7357589Z 2025-05-07T20:26:41.7357595Z 2025-05-07T20:26:41.7357610Z 2025-05-07T20:26:41.7357729Z 2025-05-07T20:26:41.7357736Z 2025-05-07T20:26:41.7357741Z 2025-05-07T20:26:41.7357746Z 2025-05-07T20:26:41.7357750Z 2025-05-07T20:26:41.7357986Z  2025-05-07T20:26:41.7358290Z 2025-05-07T20:26:41.7358295Z 2025-05-07T20:26:41.7358300Z 2025-05-07T20:26:41.7358305Z 2025-05-07T20:26:41.7358310Z 2025-05-07T20:26:41.7358315Z 2025-05-07T20:26:41.7358320Z 2025-05-07T20:26:41.7358325Z 2025-05-07T20:26:41.7358330Z 2025-05-07T20:26:41.7358335Z 2025-05-07T20:26:41.7358340Z 2025-05-07T20:26:41.7358345Z 2025-05-07T20:26:41.7358440Z 2025-05-07T20:26:41.7358445Z 2025-05-07T20:26:41.7358450Z 2025-05-07T20:26:41.7358455Z 2025-05-07T20:26:41.7358460Z 2025-05-07T20:26:41.7358465Z 2025-05-07T20:26:41.7358696Z  2025-05-07T20:26:41.7358917Z 2025-05-07T20:26:41.7358920Z 2025-05-07T20:26:41.7359064Z  2025-05-07T20:26:41.7359213Z 2025-05-07T20:26:41.7359217Z 2025-05-07T20:26:41.7359536Z  2025-05-07T20:26:41.7359664Z 2025-05-07T20:26:41.7359668Z 2025-05-07T20:26:41.7359674Z 2025-05-07T20:26:41.7359970Z  2025-05-07T20:26:41.7360087Z 2025-05-07T20:26:41.7360091Z 2025-05-07T20:26:41.7360096Z 2025-05-07T20:26:41.7360100Z 2025-05-07T20:26:41.7360478Z  2025-05-07T20:26:41.7360601Z 2025-05-07T20:26:41.7360614Z 2025-05-07T20:26:41.7360618Z 2025-05-07T20:26:41.7360622Z 2025-05-07T20:26:41.7360625Z 2025-05-07T20:26:41.7361077Z  2025-05-07T20:26:41.7361221Z 2025-05-07T20:26:41.7361225Z 2025-05-07T20:26:41.7361241Z 2025-05-07T20:26:41.7361245Z 2025-05-07T20:26:41.7361248Z 2025-05-07T20:26:41.7361254Z 2025-05-07T20:26:41.7361449Z  2025-05-07T20:26:41.7361635Z 2025-05-07T20:26:41.7361643Z 2025-05-07T20:26:41.7361647Z 2025-05-07T20:26:41.7361650Z 2025-05-07T20:26:41.7361654Z 2025-05-07T20:26:41.7361657Z 2025-05-07T20:26:41.7361661Z 2025-05-07T20:26:41.7361971Z  2025-05-07T20:26:41.7362170Z 2025-05-07T20:26:41.7362189Z 2025-05-07T20:26:41.7362192Z 2025-05-07T20:26:41.7362196Z 2025-05-07T20:26:41.7362200Z 2025-05-07T20:26:41.7362203Z 2025-05-07T20:26:41.7362207Z 2025-05-07T20:26:41.7362210Z 2025-05-07T20:26:41.7362441Z  2025-05-07T20:26:41.7362639Z 2025-05-07T20:26:41.7362651Z 2025-05-07T20:26:41.7362657Z 2025-05-07T20:26:41.7362662Z 2025-05-07T20:26:41.7362667Z 2025-05-07T20:26:41.7362672Z 2025-05-07T20:26:41.7362677Z 2025-05-07T20:26:41.7362683Z 2025-05-07T20:26:41.7362688Z 2025-05-07T20:26:41.7362944Z  2025-05-07T20:26:41.7363115Z 2025-05-07T20:26:41.7363119Z 2025-05-07T20:26:41.7363122Z 2025-05-07T20:26:41.7363130Z 2025-05-07T20:26:41.7363133Z 2025-05-07T20:26:41.7363136Z 2025-05-07T20:26:41.7363140Z 2025-05-07T20:26:41.7363144Z 2025-05-07T20:26:41.7363158Z 2025-05-07T20:26:41.7363161Z 2025-05-07T20:26:41.7363429Z  2025-05-07T20:26:41.7363635Z 2025-05-07T20:26:41.7363644Z 2025-05-07T20:26:41.7363657Z 2025-05-07T20:26:41.7363661Z 2025-05-07T20:26:41.7363664Z 2025-05-07T20:26:41.7363668Z 2025-05-07T20:26:41.7363671Z 2025-05-07T20:26:41.7363675Z 2025-05-07T20:26:41.7363678Z 2025-05-07T20:26:41.7363682Z 2025-05-07T20:26:41.7363686Z 2025-05-07T20:26:41.7363907Z  2025-05-07T20:26:41.7364135Z 2025-05-07T20:26:41.7364144Z 2025-05-07T20:26:41.7364147Z 2025-05-07T20:26:41.7364159Z 2025-05-07T20:26:41.7364162Z 2025-05-07T20:26:41.7364166Z 2025-05-07T20:26:41.7364170Z 2025-05-07T20:26:41.7364173Z 2025-05-07T20:26:41.7364177Z 2025-05-07T20:26:41.7364186Z 2025-05-07T20:26:41.7364190Z 2025-05-07T20:26:41.7364193Z 2025-05-07T20:26:41.7364366Z  2025-05-07T20:26:41.7364559Z 2025-05-07T20:26:41.7364577Z 2025-05-07T20:26:41.7364581Z 2025-05-07T20:26:41.7364584Z 2025-05-07T20:26:41.7364588Z 2025-05-07T20:26:41.7364591Z 2025-05-07T20:26:41.7364595Z 2025-05-07T20:26:41.7364598Z 2025-05-07T20:26:41.7364602Z 2025-05-07T20:26:41.7364712Z 2025-05-07T20:26:41.7364717Z 2025-05-07T20:26:41.7364720Z 2025-05-07T20:26:41.7364724Z 2025-05-07T20:26:41.7364870Z  2025-05-07T20:26:41.7365072Z 2025-05-07T20:26:41.7365076Z 2025-05-07T20:26:41.7365079Z 2025-05-07T20:26:41.7365083Z 2025-05-07T20:26:41.7365086Z 2025-05-07T20:26:41.7365090Z 2025-05-07T20:26:41.7365093Z 2025-05-07T20:26:41.7365097Z 2025-05-07T20:26:41.7365100Z 2025-05-07T20:26:41.7365104Z 2025-05-07T20:26:41.7365108Z 2025-05-07T20:26:41.7365111Z 2025-05-07T20:26:41.7365115Z 2025-05-07T20:26:41.7365204Z 2025-05-07T20:26:41.7365359Z  2025-05-07T20:26:41.7365554Z 2025-05-07T20:26:41.7365558Z 2025-05-07T20:26:41.7365561Z 2025-05-07T20:26:41.7365565Z 2025-05-07T20:26:41.7365568Z 2025-05-07T20:26:41.7365572Z 2025-05-07T20:26:41.7365575Z 2025-05-07T20:26:41.7365578Z 2025-05-07T20:26:41.7365582Z 2025-05-07T20:26:41.7365585Z 2025-05-07T20:26:41.7365599Z 2025-05-07T20:26:41.7365608Z 2025-05-07T20:26:41.7365612Z 2025-05-07T20:26:41.7365615Z 2025-05-07T20:26:41.7365621Z 2025-05-07T20:26:41.7365895Z  2025-05-07T20:26:41.7366113Z 2025-05-07T20:26:41.7366117Z 2025-05-07T20:26:41.7366129Z 2025-05-07T20:26:41.7366132Z 2025-05-07T20:26:41.7366136Z 2025-05-07T20:26:41.7366142Z 2025-05-07T20:26:41.7366148Z 2025-05-07T20:26:41.7366153Z 2025-05-07T20:26:41.7366158Z 2025-05-07T20:26:41.7366163Z 2025-05-07T20:26:41.7366168Z 2025-05-07T20:26:41.7366185Z 2025-05-07T20:26:41.7366190Z 2025-05-07T20:26:41.7366196Z 2025-05-07T20:26:41.7366212Z 2025-05-07T20:26:41.7366217Z 2025-05-07T20:26:41.7366414Z  2025-05-07T20:26:41.7366641Z 2025-05-07T20:26:41.7366646Z 2025-05-07T20:26:41.7366659Z 2025-05-07T20:26:41.7366664Z 2025-05-07T20:26:41.7366670Z 2025-05-07T20:26:41.7366675Z 2025-05-07T20:26:41.7366680Z 2025-05-07T20:26:41.7366685Z 2025-05-07T20:26:41.7366690Z 2025-05-07T20:26:41.7366696Z 2025-05-07T20:26:41.7366709Z 2025-05-07T20:26:41.7366714Z 2025-05-07T20:26:41.7366720Z 2025-05-07T20:26:41.7366725Z 2025-05-07T20:26:41.7366730Z 2025-05-07T20:26:41.7366735Z 2025-05-07T20:26:41.7366740Z 2025-05-07T20:26:41.7366935Z  2025-05-07T20:26:41.7367152Z 2025-05-07T20:26:41.7367156Z 2025-05-07T20:26:41.7367159Z 2025-05-07T20:26:41.7367163Z 2025-05-07T20:26:41.7367167Z 2025-05-07T20:26:41.7367170Z 2025-05-07T20:26:41.7367174Z 2025-05-07T20:26:41.7367177Z 2025-05-07T20:26:41.7367181Z 2025-05-07T20:26:41.7367188Z 2025-05-07T20:26:41.7367192Z 2025-05-07T20:26:41.7367195Z 2025-05-07T20:26:41.7367199Z 2025-05-07T20:26:41.7367202Z 2025-05-07T20:26:41.7367216Z 2025-05-07T20:26:41.7367220Z 2025-05-07T20:26:41.7367223Z 2025-05-07T20:26:41.7367227Z 2025-05-07T20:26:41.7367631Z  2025-05-07T20:26:41.7367878Z 2025-05-07T20:26:41.7367893Z 2025-05-07T20:26:41.7368034Z  2025-05-07T20:26:41.7368152Z 2025-05-07T20:26:41.7368155Z 2025-05-07T20:26:41.7368387Z  2025-05-07T20:26:41.7368497Z 2025-05-07T20:26:41.7368509Z 2025-05-07T20:26:41.7368515Z 2025-05-07T20:26:41.7368805Z  2025-05-07T20:26:41.7368916Z 2025-05-07T20:26:41.7368921Z 2025-05-07T20:26:41.7368925Z 2025-05-07T20:26:41.7368929Z 2025-05-07T20:26:41.7369217Z  2025-05-07T20:26:41.7369335Z 2025-05-07T20:26:41.7369341Z 2025-05-07T20:26:41.7369345Z 2025-05-07T20:26:41.7369357Z 2025-05-07T20:26:41.7369361Z 2025-05-07T20:26:41.7369725Z  2025-05-07T20:26:41.7369872Z 2025-05-07T20:26:41.7369876Z 2025-05-07T20:26:41.7369879Z 2025-05-07T20:26:41.7369883Z 2025-05-07T20:26:41.7369889Z 2025-05-07T20:26:41.7369892Z 2025-05-07T20:26:41.7370148Z  2025-05-07T20:26:41.7370335Z 2025-05-07T20:26:41.7370339Z 2025-05-07T20:26:41.7370343Z 2025-05-07T20:26:41.7370346Z 2025-05-07T20:26:41.7370350Z 2025-05-07T20:26:41.7370354Z 2025-05-07T20:26:41.7370357Z 2025-05-07T20:26:41.7370736Z  2025-05-07T20:26:41.7370914Z 2025-05-07T20:26:41.7370918Z 2025-05-07T20:26:41.7370922Z 2025-05-07T20:26:41.7370925Z 2025-05-07T20:26:41.7370929Z 2025-05-07T20:26:41.7370932Z 2025-05-07T20:26:41.7370936Z 2025-05-07T20:26:41.7370939Z 2025-05-07T20:26:41.7371115Z  2025-05-07T20:26:41.7371286Z 2025-05-07T20:26:41.7371290Z 2025-05-07T20:26:41.7371293Z 2025-05-07T20:26:41.7371301Z 2025-05-07T20:26:41.7371304Z 2025-05-07T20:26:41.7371308Z 2025-05-07T20:26:41.7371311Z 2025-05-07T20:26:41.7371315Z 2025-05-07T20:26:41.7371417Z 2025-05-07T20:26:41.7371612Z  2025-05-07T20:26:41.7371772Z 2025-05-07T20:26:41.7371776Z 2025-05-07T20:26:41.7371779Z 2025-05-07T20:26:41.7371782Z 2025-05-07T20:26:41.7371786Z 2025-05-07T20:26:41.7371789Z 2025-05-07T20:26:41.7371796Z 2025-05-07T20:26:41.7371801Z 2025-05-07T20:26:41.7371806Z 2025-05-07T20:26:41.7371811Z 2025-05-07T20:26:41.7372002Z  2025-05-07T20:26:41.7372179Z 2025-05-07T20:26:41.7372182Z 2025-05-07T20:26:41.7372185Z 2025-05-07T20:26:41.7372189Z 2025-05-07T20:26:41.7372192Z 2025-05-07T20:26:41.7372196Z 2025-05-07T20:26:41.7372199Z 2025-05-07T20:26:41.7372203Z 2025-05-07T20:26:41.7372206Z 2025-05-07T20:26:41.7372209Z 2025-05-07T20:26:41.7372222Z 2025-05-07T20:26:41.7372360Z  2025-05-07T20:26:41.7372538Z 2025-05-07T20:26:41.7372541Z 2025-05-07T20:26:41.7372545Z 2025-05-07T20:26:41.7372548Z 2025-05-07T20:26:41.7372555Z 2025-05-07T20:26:41.7372566Z 2025-05-07T20:26:41.7372579Z 2025-05-07T20:26:41.7372583Z 2025-05-07T20:26:41.7372586Z 2025-05-07T20:26:41.7372590Z 2025-05-07T20:26:41.7372593Z 2025-05-07T20:26:41.7372597Z 2025-05-07T20:26:41.7372734Z  2025-05-07T20:26:41.7372982Z 2025-05-07T20:26:41.7372987Z 2025-05-07T20:26:41.7372990Z 2025-05-07T20:26:41.7372998Z 2025-05-07T20:26:41.7373002Z 2025-05-07T20:26:41.7373005Z 2025-05-07T20:26:41.7373009Z 2025-05-07T20:26:41.7373019Z 2025-05-07T20:26:41.7373022Z 2025-05-07T20:26:41.7373026Z 2025-05-07T20:26:41.7373029Z 2025-05-07T20:26:41.7373032Z 2025-05-07T20:26:41.7373036Z 2025-05-07T20:26:41.7373177Z  2025-05-07T20:26:41.7373412Z 2025-05-07T20:26:41.7373417Z 2025-05-07T20:26:41.7373422Z 2025-05-07T20:26:41.7373428Z 2025-05-07T20:26:41.7373433Z 2025-05-07T20:26:41.7373438Z 2025-05-07T20:26:41.7373443Z 2025-05-07T20:26:41.7373447Z 2025-05-07T20:26:41.7373452Z 2025-05-07T20:26:41.7373457Z 2025-05-07T20:26:41.7373462Z 2025-05-07T20:26:41.7373477Z 2025-05-07T20:26:41.7373481Z 2025-05-07T20:26:41.7373497Z 2025-05-07T20:26:41.7374108Z  2025-05-07T20:26:41.7374395Z 2025-05-07T20:26:41.7374401Z 2025-05-07T20:26:41.7374406Z 2025-05-07T20:26:41.7374412Z 2025-05-07T20:26:41.7374417Z 2025-05-07T20:26:41.7374423Z 2025-05-07T20:26:41.7374428Z 2025-05-07T20:26:41.7374433Z 2025-05-07T20:26:41.7374438Z 2025-05-07T20:26:41.7374454Z 2025-05-07T20:26:41.7374459Z 2025-05-07T20:26:41.7374464Z 2025-05-07T20:26:41.7374469Z 2025-05-07T20:26:41.7374483Z 2025-05-07T20:26:41.7374604Z 2025-05-07T20:26:41.7374824Z  2025-05-07T20:26:41.7375103Z 2025-05-07T20:26:41.7375108Z 2025-05-07T20:26:41.7375113Z 2025-05-07T20:26:41.7375129Z 2025-05-07T20:26:41.7375134Z 2025-05-07T20:26:41.7375139Z 2025-05-07T20:26:41.7375144Z 2025-05-07T20:26:41.7375149Z 2025-05-07T20:26:41.7375153Z 2025-05-07T20:26:41.7375158Z 2025-05-07T20:26:41.7375163Z 2025-05-07T20:26:41.7375176Z 2025-05-07T20:26:41.7375180Z 2025-05-07T20:26:41.7375185Z 2025-05-07T20:26:41.7375190Z 2025-05-07T20:26:41.7375195Z 2025-05-07T20:26:41.7375417Z  2025-05-07T20:26:41.7375712Z 2025-05-07T20:26:41.7375717Z 2025-05-07T20:26:41.7375723Z 2025-05-07T20:26:41.7375728Z 2025-05-07T20:26:41.7375733Z 2025-05-07T20:26:41.7375738Z 2025-05-07T20:26:41.7375743Z 2025-05-07T20:26:41.7375883Z 2025-05-07T20:26:41.7375889Z 2025-05-07T20:26:41.7375894Z 2025-05-07T20:26:41.7375899Z 2025-05-07T20:26:41.7375904Z 2025-05-07T20:26:41.7375909Z 2025-05-07T20:26:41.7375914Z 2025-05-07T20:26:41.7375919Z 2025-05-07T20:26:41.7375924Z 2025-05-07T20:26:41.7375943Z 2025-05-07T20:26:41.7376176Z  2025-05-07T20:26:41.7376467Z 2025-05-07T20:26:41.7376472Z 2025-05-07T20:26:41.7376477Z 2025-05-07T20:26:41.7376491Z 2025-05-07T20:26:41.7376496Z 2025-05-07T20:26:41.7376501Z 2025-05-07T20:26:41.7376505Z 2025-05-07T20:26:41.7376606Z 2025-05-07T20:26:41.7376611Z 2025-05-07T20:26:41.7376616Z 2025-05-07T20:26:41.7376621Z 2025-05-07T20:26:41.7376626Z 2025-05-07T20:26:41.7376631Z 2025-05-07T20:26:41.7376636Z 2025-05-07T20:26:41.7376641Z 2025-05-07T20:26:41.7376646Z 2025-05-07T20:26:41.7376651Z 2025-05-07T20:26:41.7376655Z 2025-05-07T20:26:41.7376886Z  2025-05-07T20:26:41.7377201Z 2025-05-07T20:26:41.7377213Z 2025-05-07T20:26:41.7377351Z  2025-05-07T20:26:41.7377508Z 2025-05-07T20:26:41.7377513Z 2025-05-07T20:26:41.7377658Z  2025-05-07T20:26:41.7377782Z 2025-05-07T20:26:41.7377785Z 2025-05-07T20:26:41.7377789Z 2025-05-07T20:26:41.7377907Z  2025-05-07T20:26:41.7378019Z 2025-05-07T20:26:41.7378022Z 2025-05-07T20:26:41.7378026Z 2025-05-07T20:26:41.7378029Z 2025-05-07T20:26:41.7378392Z  2025-05-07T20:26:41.7378577Z 2025-05-07T20:26:41.7378583Z 2025-05-07T20:26:41.7378588Z 2025-05-07T20:26:41.7378592Z 2025-05-07T20:26:41.7378618Z 2025-05-07T20:26:41.7378775Z  2025-05-07T20:26:41.7378952Z 2025-05-07T20:26:41.7378957Z 2025-05-07T20:26:41.7378962Z 2025-05-07T20:26:41.7378967Z 2025-05-07T20:26:41.7378972Z 2025-05-07T20:26:41.7378999Z 2025-05-07T20:26:41.7379169Z  2025-05-07T20:26:41.7379337Z 2025-05-07T20:26:41.7379341Z 2025-05-07T20:26:41.7379345Z 2025-05-07T20:26:41.7379348Z 2025-05-07T20:26:41.7379359Z 2025-05-07T20:26:41.7379370Z 2025-05-07T20:26:41.7379374Z 2025-05-07T20:26:41.7379514Z  2025-05-07T20:26:41.7379706Z 2025-05-07T20:26:41.7379710Z 2025-05-07T20:26:41.7379713Z 2025-05-07T20:26:41.7379717Z 2025-05-07T20:26:41.7379720Z 2025-05-07T20:26:41.7379731Z 2025-05-07T20:26:41.7379734Z 2025-05-07T20:26:41.7379738Z 2025-05-07T20:26:41.7379889Z  2025-05-07T20:26:41.7380106Z 2025-05-07T20:26:41.7380111Z 2025-05-07T20:26:41.7380117Z 2025-05-07T20:26:41.7380122Z 2025-05-07T20:26:41.7380137Z 2025-05-07T20:26:41.7380142Z 2025-05-07T20:26:41.7380155Z 2025-05-07T20:26:41.7380160Z 2025-05-07T20:26:41.7380165Z 2025-05-07T20:26:41.7380349Z  2025-05-07T20:26:41.7380534Z 2025-05-07T20:26:41.7380539Z 2025-05-07T20:26:41.7380544Z 2025-05-07T20:26:41.7380556Z 2025-05-07T20:26:41.7380561Z 2025-05-07T20:26:41.7380583Z 2025-05-07T20:26:41.7380588Z 2025-05-07T20:26:41.7380594Z 2025-05-07T20:26:41.7380599Z 2025-05-07T20:26:41.7380604Z 2025-05-07T20:26:41.7380787Z  2025-05-07T20:26:41.7381043Z 2025-05-07T20:26:41.7381047Z 2025-05-07T20:26:41.7381050Z 2025-05-07T20:26:41.7381053Z 2025-05-07T20:26:41.7381057Z 2025-05-07T20:26:41.7381060Z 2025-05-07T20:26:41.7381064Z 2025-05-07T20:26:41.7381067Z 2025-05-07T20:26:41.7381071Z 2025-05-07T20:26:41.7381074Z 2025-05-07T20:26:41.7381078Z 2025-05-07T20:26:41.7381269Z  2025-05-07T20:26:41.7381541Z 2025-05-07T20:26:41.7381546Z 2025-05-07T20:26:41.7381551Z 2025-05-07T20:26:41.7381556Z 2025-05-07T20:26:41.7381568Z 2025-05-07T20:26:41.7381573Z 2025-05-07T20:26:41.7381578Z 2025-05-07T20:26:41.7381583Z 2025-05-07T20:26:41.7381588Z 2025-05-07T20:26:41.7381594Z 2025-05-07T20:26:41.7381598Z 2025-05-07T20:26:41.7381603Z 2025-05-07T20:26:41.7381791Z  2025-05-07T20:26:41.7381980Z 2025-05-07T20:26:41.7381984Z 2025-05-07T20:26:41.7381987Z 2025-05-07T20:26:41.7381991Z 2025-05-07T20:26:41.7382428Z 2025-05-07T20:26:41.7382437Z 2025-05-07T20:26:41.7382442Z 2025-05-07T20:26:41.7382447Z 2025-05-07T20:26:41.7382452Z 2025-05-07T20:26:41.7382457Z 2025-05-07T20:26:41.7382462Z 2025-05-07T20:26:41.7382467Z 2025-05-07T20:26:41.7382486Z 2025-05-07T20:26:41.7382673Z  2025-05-07T20:26:41.7382923Z 2025-05-07T20:26:41.7382928Z 2025-05-07T20:26:41.7382934Z 2025-05-07T20:26:41.7382939Z 2025-05-07T20:26:41.7382944Z 2025-05-07T20:26:41.7382949Z 2025-05-07T20:26:41.7382964Z 2025-05-07T20:26:41.7382970Z 2025-05-07T20:26:41.7382975Z 2025-05-07T20:26:41.7383114Z 2025-05-07T20:26:41.7383118Z 2025-05-07T20:26:41.7383121Z 2025-05-07T20:26:41.7383125Z 2025-05-07T20:26:41.7383128Z 2025-05-07T20:26:41.7383348Z  2025-05-07T20:26:41.7383615Z 2025-05-07T20:26:41.7383620Z 2025-05-07T20:26:41.7383648Z 2025-05-07T20:26:41.7383653Z 2025-05-07T20:26:41.7383659Z 2025-05-07T20:26:41.7383664Z 2025-05-07T20:26:41.7383670Z 2025-05-07T20:26:41.7383682Z 2025-05-07T20:26:41.7383686Z 2025-05-07T20:26:41.7383689Z 2025-05-07T20:26:41.7383693Z 2025-05-07T20:26:41.7383696Z 2025-05-07T20:26:41.7383700Z 2025-05-07T20:26:41.7383703Z 2025-05-07T20:26:41.7383707Z 2025-05-07T20:26:41.7383879Z  2025-05-07T20:26:41.7384084Z 2025-05-07T20:26:41.7384087Z 2025-05-07T20:26:41.7384091Z 2025-05-07T20:26:41.7384094Z 2025-05-07T20:26:41.7384098Z 2025-05-07T20:26:41.7384101Z 2025-05-07T20:26:41.7384105Z 2025-05-07T20:26:41.7384108Z 2025-05-07T20:26:41.7384112Z 2025-05-07T20:26:41.7384130Z 2025-05-07T20:26:41.7384133Z 2025-05-07T20:26:41.7384137Z 2025-05-07T20:26:41.7384140Z 2025-05-07T20:26:41.7384144Z 2025-05-07T20:26:41.7384147Z 2025-05-07T20:26:41.7384151Z 2025-05-07T20:26:41.7384308Z  2025-05-07T20:26:41.7384527Z 2025-05-07T20:26:41.7384530Z 2025-05-07T20:26:41.7384534Z 2025-05-07T20:26:41.7384537Z 2025-05-07T20:26:41.7384541Z 2025-05-07T20:26:41.7384549Z 2025-05-07T20:26:41.7384553Z 2025-05-07T20:26:41.7384557Z 2025-05-07T20:26:41.7384560Z 2025-05-07T20:26:41.7384564Z 2025-05-07T20:26:41.7384567Z 2025-05-07T20:26:41.7384570Z 2025-05-07T20:26:41.7384574Z 2025-05-07T20:26:41.7384577Z 2025-05-07T20:26:41.7384581Z 2025-05-07T20:26:41.7384584Z 2025-05-07T20:26:41.7384588Z 2025-05-07T20:26:41.7384756Z  2025-05-07T20:26:41.7384995Z 2025-05-07T20:26:41.7384999Z 2025-05-07T20:26:41.7385002Z 2025-05-07T20:26:41.7385006Z 2025-05-07T20:26:41.7385009Z 2025-05-07T20:26:41.7385017Z 2025-05-07T20:26:41.7385021Z 2025-05-07T20:26:41.7385024Z 2025-05-07T20:26:41.7385028Z 2025-05-07T20:26:41.7385031Z 2025-05-07T20:26:41.7385035Z 2025-05-07T20:26:41.7385038Z 2025-05-07T20:26:41.7385041Z 2025-05-07T20:26:41.7385045Z 2025-05-07T20:26:41.7385058Z 2025-05-07T20:26:41.7385061Z 2025-05-07T20:26:41.7385065Z 2025-05-07T20:26:41.7385069Z 2025-05-07T20:26:41.7385263Z  2025-05-07T20:26:41.7385478Z 2025-05-07T20:26:41.7385482Z 2025-05-07T20:26:41.7385595Z  2025-05-07T20:26:41.7385701Z 2025-05-07T20:26:41.7385705Z 2025-05-07T20:26:41.7385812Z  2025-05-07T20:26:41.7385921Z 2025-05-07T20:26:41.7385924Z 2025-05-07T20:26:41.7385928Z 2025-05-07T20:26:41.7386035Z  2025-05-07T20:26:41.7386155Z 2025-05-07T20:26:41.7386159Z 2025-05-07T20:26:41.7386162Z 2025-05-07T20:26:41.7386165Z 2025-05-07T20:26:41.7386273Z  2025-05-07T20:26:41.7386403Z 2025-05-07T20:26:41.7386407Z 2025-05-07T20:26:41.7386415Z 2025-05-07T20:26:41.7386419Z 2025-05-07T20:26:41.7386422Z 2025-05-07T20:26:41.7386532Z  2025-05-07T20:26:41.7386660Z 2025-05-07T20:26:41.7386663Z 2025-05-07T20:26:41.7386667Z 2025-05-07T20:26:41.7386677Z 2025-05-07T20:26:41.7386680Z 2025-05-07T20:26:41.7386684Z 2025-05-07T20:26:41.7388480Z  done 2025-05-07T20:26:41.9407674Z Preparing transaction: / - done 2025-05-07T20:26:43.1433549Z Verifying transaction: | / - \ | / - \ | / - \ done 2025-05-07T20:26:43.6504492Z Executing transaction: / - \ | / done 2025-05-07T20:26:45.8563443Z [INSTALL] Fixing file placements for CUDA 12.6.3+ ... 2025-05-07T20:26:45.8564247Z [INSTALL] Creating symlinks: libnvToolsExt.so 2025-05-07T20:26:45.8565673Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:45.8566474Z 2025-05-07T20:26:45.8579332Z 2025-05-07T20:26:45.8580211Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:45.8581351Z 2025-05-07T20:26:45.8592998Z 2025-05-07T20:26:45.8593216Z [INSTALL] Copying nvtx3 headers ... 2025-05-07T20:26:45.8598810Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/ 2025-05-07T20:26:45.8602967Z 2025-05-07T20:26:45.8803029Z 2025-05-07T20:26:45.8808882Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2024.3.2/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/ 2025-05-07T20:26:45.8812989Z 2025-05-07T20:26:45.8831658Z 2025-05-07T20:26:45.8831994Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ... 2025-05-07T20:26:45.9212011Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ... 2025-05-07T20:26:47.8150202Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error) 2025-05-07T20:26:47.8792606Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs 2025-05-07T20:26:47.8793152Z 2025-05-07T20:26:48.3104729Z 2025-05-07T20:26:48.3112744Z [INSTALL] Setting environment variable NVML_LIB_PATH ... 2025-05-07T20:26:48.3467051Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:48.3467634Z 2025-05-07T20:26:48.7831366Z 2025-05-07T20:26:48.7831691Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ... 2025-05-07T20:26:48.7832891Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/" 2025-05-07T20:26:48.7833659Z 2025-05-07T20:26:49.2164497Z 2025-05-07T20:26:51.2615225Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h 2025-05-07T20:26:53.3195405Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so 2025-05-07T20:26:55.3649056Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so 2025-05-07T20:26:55.3650317Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so 2025-05-07T20:26:57.4121204Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T20:26:59.3206141Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc 2025-05-07T20:26:59.3206509Z 2025-05-07T20:26:59.3895327Z [CHECK] Binary nvcc found in PATH 2025-05-07T20:27:03.2726378Z /tmp/tmpkqgbq979: line 3: clang: command not found 2025-05-07T20:27:03.2726682Z 2025-05-07T20:27:03.2727725Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error) 2025-05-07T20:27:03.3413214Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d 2025-05-07T20:27:03.3413543Z 2025-05-07T20:27:03.3438063Z total 36 2025-05-07T20:27:03.3438463Z drwxr-xr-x. 2 ec2-user ec2-user 191 May 7 20:26 . 2025-05-07T20:27:03.3439005Z drwxr-xr-x. 5 ec2-user ec2-user 62 May 7 20:25 .. 2025-05-07T20:27:03.3439573Z -rw-r--r--. 2 ec2-user ec2-user 3778 Jun 10 2024 activate-binutils_linux-64.sh 2025-05-07T20:27:03.3440210Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10 2024 activate-gcc_linux-64.sh 2025-05-07T20:27:03.3440920Z -rw-r--r--. 2 ec2-user ec2-user 5190 Jun 10 2024 activate-gxx_linux-64.sh 2025-05-07T20:27:03.3441586Z -rw-r--r--. 2 ec2-user ec2-user 136 Mar 27 01:27 libglib_activate.sh 2025-05-07T20:27:03.3442221Z -rw-r--r--. 2 ec2-user ec2-user 872 Nov 13 09:20 libxml2_activate.sh 2025-05-07T20:27:03.3442698Z -rw-r--r--. 2 ec2-user ec2-user 2932 Nov 20 20:32 ~cuda-nvcc_activate.sh 2025-05-07T20:27:03.3442995Z 2025-05-07T20:27:03.3443223Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ... 2025-05-07T20:27:03.3443871Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh 2025-05-07T20:27:03.3444312Z 2025-05-07T20:27:03.3463846Z 2025-05-07T20:27:03.3464351Z + conda run -n build_binary c++ --version | grep -i clang 2025-05-07T20:27:03.3466214Z 2025-05-07T20:27:05.3227854Z 2025-05-07T20:27:05.3228593Z [BUILD] Setting prepend flags for NVCC ... 2025-05-07T20:27:05.3229342Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler" 2025-05-07T20:27:05.3229749Z 2025-05-07T20:27:05.7520913Z 2025-05-07T20:27:05.7521299Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS 2025-05-07T20:27:05.7521717Z 2025-05-07T20:27:07.6502595Z -allow-unsupported-compiler 2025-05-07T20:27:07.6502898Z 2025-05-07T20:27:07.7200015Z 2025-05-07T20:27:07.7200898Z [INFO] Printing out all preprocessor defines in nvcc ... 2025-05-07T20:27:07.7201617Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null 2025-05-07T20:27:07.7201967Z 2025-05-07T20:27:09.6903579Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead"))) 2025-05-07T20:27:09.6904313Z #define M_PIl 3.141592653589793238462643383279502884L 2025-05-07T20:27:09.6904706Z #define _IO_CURRENTLY_PUTTING 0x800 2025-05-07T20:27:09.6905045Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig)) 2025-05-07T20:27:09.6905379Z #define __DBL_MIN_EXP__ (-1021) 2025-05-07T20:27:09.6905652Z #define _STL_PAIR_H 1 2025-05-07T20:27:09.6905913Z #define __cpp_attributes 200809L 2025-05-07T20:27:09.6906245Z #define __cpp_nontype_template_parameter_auto 201606L 2025-05-07T20:27:09.6906998Z #define __DELETE_THROW throw() 2025-05-07T20:27:09.6907286Z #define _PTRDIFF_T_ 2025-05-07T20:27:09.6907523Z #define M_PI_4 0.78539816339744830962 2025-05-07T20:27:09.6907819Z #define __UINT_LEAST16_MAX__ 0xffff 2025-05-07T20:27:09.6908095Z #define _IO_LEFT 02 2025-05-07T20:27:09.6908322Z #define __ATOMIC_ACQUIRE 2 2025-05-07T20:27:09.6908603Z #define _POSIX2_BC_SCALE_MAX 99 2025-05-07T20:27:09.6909046Z #define _GLIBCXX_USE_RANDOM_TR1 1 2025-05-07T20:27:09.6909656Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp) 2025-05-07T20:27:09.6910179Z #define __FLT128_MAX_10_EXP__ 4932 2025-05-07T20:27:09.6910674Z #define RE_DUP_MAX (0x7fff) 2025-05-07T20:27:09.6910949Z #define _IOS_OUTPUT 2 2025-05-07T20:27:09.6911265Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F 2025-05-07T20:27:09.6911772Z #define toascii_l(c,l) __toascii_l ((c), (l)) 2025-05-07T20:27:09.6912253Z #define __GCC_IEC_559_COMPLEX 2 2025-05-07T20:27:09.6912690Z #define _GLIBCXX_USE_FCHMOD 1 2025-05-07T20:27:09.6913129Z #define __cpp_aggregate_nsdmi 201304L 2025-05-07T20:27:09.6914386Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; })) 2025-05-07T20:27:09.6927869Z #define __UINT_LEAST8_TYPE__ unsigned char 2025-05-07T20:27:09.6928339Z #define __SIZEOF_FLOAT80__ 16 2025-05-07T20:27:09.6928768Z #define cudaTextureTypeCubemapLayered 0xFC 2025-05-07T20:27:09.6929220Z #define _T_WCHAR_ 2025-05-07T20:27:09.6929559Z #define stdout stdout 2025-05-07T20:27:09.6930065Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11"))) 2025-05-07T20:27:09.6930639Z #define CHAR_BIT __CHAR_BIT__ 2025-05-07T20:27:09.6931019Z #define __flexarr [] 2025-05-07T20:27:09.6931370Z #define _GLIBCXX_HAVE_FINITEF 1 2025-05-07T20:27:09.6931839Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l)) 2025-05-07T20:27:09.6932363Z #define _IO_FLAGS2_USER_WBUF 8 2025-05-07T20:27:09.6932740Z #define _MATH_H 1 2025-05-07T20:27:09.6933139Z #define cudaOccupancyDisableCachingOverride 0x01 2025-05-07T20:27:09.6933626Z #define __S64_TYPE long int 2025-05-07T20:27:09.6933984Z #define __stub_fchflags 2025-05-07T20:27:09.6934276Z #define cudaDeviceScheduleMask 0x07 2025-05-07T20:27:09.6934687Z #define __SQUAD_TYPE long int 2025-05-07T20:27:09.6934961Z #define __INTMAX_C(c) c ## L 2025-05-07T20:27:09.6935234Z #define _BSD_SIZE_T_DEFINED_ 2025-05-07T20:27:09.6935493Z #define NL_NMAX INT_MAX 2025-05-07T20:27:09.6935737Z #define _BITS_TIME_H 1 2025-05-07T20:27:09.6936032Z #define M_LN10l 2.302585092994045684017991454684364208L 2025-05-07T20:27:09.6936362Z #define _GLIBCXX_TXN_SAFE_DYN 2025-05-07T20:27:09.6936681Z #define cudaStreamTailLaunch ((cudaStream_t)0x3) 2025-05-07T20:27:09.6937045Z #define M_El 2.718281828459045235360287471352662498L 2025-05-07T20:27:09.6937457Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd) 2025-05-07T20:27:09.6937831Z #define __CHAR_BIT__ 8 2025-05-07T20:27:09.6938100Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:09.6938417Z #define _PSTL_STRING_CONCAT(x,y) x #y 2025-05-07T20:27:09.6938711Z #define _GLIBCXX98_USE_C99_MATH 1 2025-05-07T20:27:09.6938982Z #define FP_NAN 0 2025-05-07T20:27:09.6939242Z #define makedev(maj,min) gnu_dev_makedev (maj, min) 2025-05-07T20:27:09.6939679Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 2025-05-07T20:27:09.6940186Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2 2025-05-07T20:27:09.6940583Z #define __cudaCDP2GetErrorString 2025-05-07T20:27:09.6940883Z #define SHRT_MAX __SHRT_MAX__ 2025-05-07T20:27:09.6941142Z #define _GLIBCXX_X86_RDSEED 1 2025-05-07T20:27:09.6941401Z #define __SM_80_RT_H__ 2025-05-07T20:27:09.6941626Z #define _NEW 2025-05-07T20:27:09.6941852Z #define CLOCK_PROCESS_CPUTIME_ID 2 2025-05-07T20:27:09.6942137Z #define __UINT8_MAX__ 0xff 2025-05-07T20:27:09.6942508Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition) 2025-05-07T20:27:09.6943146Z #define __SCHAR_WIDTH__ 8 2025-05-07T20:27:09.6943390Z #define __USE_ANSI 1 2025-05-07T20:27:09.6943676Z #define _IO_BE(expr,res) __builtin_expect ((expr), res) 2025-05-07T20:27:09.6944070Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l)) 2025-05-07T20:27:09.6944440Z #define __cudaCDP2Memcpy2DAsync_ptsz 2025-05-07T20:27:09.6944751Z #define __WINT_MAX__ 0xffffffffU 2025-05-07T20:27:09.6945025Z #define __SIZEOF_PTHREAD_ATTR_T 56 2025-05-07T20:27:09.6945310Z #define __FLT32_MIN_EXP__ (-125) 2025-05-07T20:27:09.6945590Z #define _GLIBCXX_END_NAMESPACE_LDBL 2025-05-07T20:27:09.6946032Z #define PIPE_BUF 4096 2025-05-07T20:27:09.6946358Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 2025-05-07T20:27:09.6946727Z #define ADJ_TICK 0x4000 2025-05-07T20:27:09.6947012Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10) 2025-05-07T20:27:09.6947332Z #define MQ_PRIO_MAX 32768 2025-05-07T20:27:09.6947604Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4 2025-05-07T20:27:09.6947944Z #define __WAIT_INT(status) (*(int *) &(status)) 2025-05-07T20:27:09.6948413Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:27:09.6948956Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01 2025-05-07T20:27:09.6949338Z #define _XOPEN_SOURCE 700 2025-05-07T20:27:09.6949601Z #define _POSIX2_BC_DIM_MAX 2048 2025-05-07T20:27:09.6949885Z #define __VECTOR_FUNCTIONS_HPP__ 2025-05-07T20:27:09.6950180Z #define __cpp_static_assert 201411L 2025-05-07T20:27:09.6950535Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8) 2025-05-07T20:27:09.6950893Z #define _GLIBCXX_HAVE_STRXFRM_L 1 2025-05-07T20:27:09.6951187Z #define _POSIX_TTY_NAME_MAX 9 2025-05-07T20:27:09.6951476Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__ 2025-05-07T20:27:09.6951785Z #define __OFF_T_MATCHES_OFF64_T 1 2025-05-07T20:27:09.6952081Z #define __ORDER_LITTLE_ENDIAN__ 1234 2025-05-07T20:27:09.6952392Z #define __SIZE_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:09.6952756Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l)) 2025-05-07T20:27:09.6953110Z #define __WCHAR_MAX__ 0x7fffffff 2025-05-07T20:27:09.6953399Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1 2025-05-07T20:27:09.6953719Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:09.6954094Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l)) 2025-05-07T20:27:09.6954466Z #define cudaNvSciSyncAttrSignal 0x1 2025-05-07T20:27:09.6954775Z #define _GLIBCXX_USE_LONG_LONG 1 2025-05-07T20:27:09.6955073Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1 2025-05-07T20:27:09.6955409Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1 2025-05-07T20:27:09.6955751Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1 2025-05-07T20:27:09.6956161Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L) 2025-05-07T20:27:09.6956593Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1 2025-05-07T20:27:09.6956909Z #define ADJ_ESTERROR 0x0008 2025-05-07T20:27:09.6957179Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2 2025-05-07T20:27:09.6957478Z #define __GCC_IEC_559 2 2025-05-07T20:27:09.6957832Z #define __cpp_lib_transformation_trait_aliases 201304 2025-05-07T20:27:09.6958230Z #define _IO_flockfile(_fp) 2025-05-07T20:27:09.6958505Z #define CLOCK_MONOTONIC_RAW 4 2025-05-07T20:27:09.6958784Z #define __FLT32X_DECIMAL_DIG__ 17 2025-05-07T20:27:09.6959057Z #define _IOFBF 0 2025-05-07T20:27:09.6959276Z #define __USE_BSD 1 2025-05-07T20:27:09.6959517Z #define __FLT_EVAL_METHOD__ 0 2025-05-07T20:27:09.6959792Z #define SHRT_MIN (-SHRT_MAX - 1) 2025-05-07T20:27:09.6960061Z #define _IO_USER_LOCK 0x8000 2025-05-07T20:27:09.6960317Z #define _IO_NO_WRITES 8 2025-05-07T20:27:09.6960583Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 2025-05-07T20:27:09.6960934Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname 2025-05-07T20:27:09.6961293Z #define _GLIBCXX_HAVE_SYS_STAT_H 1 2025-05-07T20:27:09.6961600Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ()) 2025-05-07T20:27:09.6961916Z #define __cpp_binary_literals 201304L 2025-05-07T20:27:09.6962209Z #define _CPP_TYPE_TRAITS_H 1 2025-05-07T20:27:09.6962573Z #define __BEGIN_NAMESPACE_C99 2025-05-07T20:27:09.6962843Z #define __FLT64_DECIMAL_DIG__ 17 2025-05-07T20:27:09.6963155Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 2025-05-07T20:27:09.6963542Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE) 2025-05-07T20:27:09.6963908Z #define __cpp_noexcept_function_type 201510L 2025-05-07T20:27:09.6964215Z #define M_PI 3.14159265358979323846 2025-05-07T20:27:09.6964526Z #define _GLIBCXX_PACKAGE_NAME "package-unused" 2025-05-07T20:27:09.6964856Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1 2025-05-07T20:27:09.6965278Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2 2025-05-07T20:27:09.6965592Z #define _POSIX_DELAYTIMER_MAX 32 2025-05-07T20:27:09.6965882Z #define _GLIBCXX_USE_UTIME 1 2025-05-07T20:27:09.6966153Z #define _STL_ITERATOR_BASE_FUNCS_H 1 2025-05-07T20:27:09.6966745Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr) 2025-05-07T20:27:09.6967349Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1 2025-05-07T20:27:09.6967680Z #define w_termsig __wait_terminated.__w_termsig 2025-05-07T20:27:09.6968001Z #define __FLOAT_WORD_ORDER __BYTE_ORDER 2025-05-07T20:27:09.6968304Z #define __cudaCDP2GetErrorName 2025-05-07T20:27:09.6968585Z #define XATTR_SIZE_MAX 65536 2025-05-07T20:27:09.6968846Z #define be64toh(x) __bswap_64 (x) 2025-05-07T20:27:09.6969150Z #define __ASSERT_VOID_CAST static_cast 2025-05-07T20:27:09.6969481Z #define __cpp_variadic_templates 200704L 2025-05-07T20:27:09.6969771Z #define RAND_MAX 2147483647 2025-05-07T20:27:09.6970048Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1 2025-05-07T20:27:09.6970385Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:09.6970692Z #define __SM_90_RT_H__ 2025-05-07T20:27:09.6970935Z #define __SIG_ATOMIC_TYPE__ int 2025-05-07T20:27:09.6971194Z #define __COMPAR_FN_T 2025-05-07T20:27:09.6971435Z #define __GID_T_TYPE __U32_TYPE 2025-05-07T20:27:09.6971694Z #define _IO_BAD_SEEN 0x4000 2025-05-07T20:27:09.6972180Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x))) 2025-05-07T20:27:09.6972700Z #define __DBL_MIN_10_EXP__ (-307) 2025-05-07T20:27:09.6973038Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 2025-05-07T20:27:09.6973405Z #define __FINITE_MATH_ONLY__ 0 2025-05-07T20:27:09.6973705Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 2025-05-07T20:27:09.6974040Z #define cudaArrayColorAttachment 0x20 2025-05-07T20:27:09.6974363Z #define __cpp_variable_templates 201304L 2025-05-07T20:27:09.6975030Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:27:09.6975586Z #define __cpp_lib_integral_constant_callable 201304 2025-05-07T20:27:09.6975919Z #define _GLIBCXX_HAVE_SINHF 1 2025-05-07T20:27:09.6976197Z #define MOD_TIMECONST ADJ_TIMECONST 2025-05-07T20:27:09.6976499Z #define __cpp_lib_result_of_sfinae 201210 2025-05-07T20:27:09.6976797Z #define __SM_30_INTRINSICS_H__ 2025-05-07T20:27:09.6977068Z #define __FLT32X_MAX_EXP__ 1024 2025-05-07T20:27:09.6977335Z #define _GLIBCXX_USE_WCHAR_T 1 2025-05-07T20:27:09.6977589Z #define _GLIBCXX_MATH_H 1 2025-05-07T20:27:09.6977835Z #define __u_char_defined 2025-05-07T20:27:09.6978150Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status)) 2025-05-07T20:27:09.6978507Z #define STA_PPSERROR 0x0800 2025-05-07T20:27:09.6978761Z #define _GLIBCXX_STD_A std 2025-05-07T20:27:09.6979072Z #define __FLT32_HAS_DENORM__ 1 2025-05-07T20:27:09.6979353Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 2025-05-07T20:27:09.6979795Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type) 2025-05-07T20:27:09.6980234Z #define FP_INFINITE 1 2025-05-07T20:27:09.6980607Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:27:09.6981030Z #define _IO_pid_t __pid_t 2025-05-07T20:27:09.6981288Z #define __UINT_FAST8_MAX__ 0xff 2025-05-07T20:27:09.6981555Z #define __LEAF , __leaf__ 2025-05-07T20:27:09.6981897Z #define PATH_MAX 4096 2025-05-07T20:27:09.6982149Z #define __cpp_rvalue_reference 200610L 2025-05-07T20:27:09.6982518Z #define __LDBL_REDIR1(name,proto,alias) name proto 2025-05-07T20:27:09.6982864Z #define _LIMITS_H___ 2025-05-07T20:27:09.6983090Z #define __size_t 2025-05-07T20:27:09.6983320Z #define _GLIBCXX_HAVE_FREXPF 1 2025-05-07T20:27:09.6983860Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK) 2025-05-07T20:27:09.6984438Z #define _GLIBCXX_HAVE_FREXPL 1 2025-05-07T20:27:09.6984746Z #define __cpp_nested_namespace_definitions 201411L 2025-05-07T20:27:09.6985251Z #define __DEC64_MAX_EXP__ 385 2025-05-07T20:27:09.6985507Z #define _WCHAR_T_DEFINED 2025-05-07T20:27:09.6985862Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 2025-05-07T20:27:09.6986295Z #define MOD_STATUS ADJ_STATUS 2025-05-07T20:27:09.6986673Z #define _GLIBCXX_PURE __attribute__ ((__pure__)) 2025-05-07T20:27:09.6987014Z #define _GLIBCXX_HAVE_STDINT_H 1 2025-05-07T20:27:09.6987372Z #define __SIZEOF_PTHREAD_CONDATTR_T 4 2025-05-07T20:27:09.6987653Z #define __INT8_C(c) c 2025-05-07T20:27:09.6987914Z #define __cudaCDP2GetParameterBuffer 2025-05-07T20:27:09.6988221Z #define _GLIBCXX_HAVE_COSHF 1 2025-05-07T20:27:09.6988482Z #define _GLIBCXX_HAVE_COSHL 1 2025-05-07T20:27:09.6988743Z #define __SM_70_RT_HPP__ 2025-05-07T20:27:09.6988996Z #define __INT_LEAST8_WIDTH__ 8 2025-05-07T20:27:09.6989273Z #define __cpp_variadic_using 201611L 2025-05-07T20:27:09.6989592Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:09.6989933Z #define __INT_LEAST8_MAX__ 0x7f 2025-05-07T20:27:09.6990213Z #define __SM_61_INTRINSICS_HPP__ 2025-05-07T20:27:09.6990483Z #define _IO_FLAGS2_MMAP 1 2025-05-07T20:27:09.6990752Z #define __cpp_capture_star_this 201603L 2025-05-07T20:27:09.6991072Z #define __cudaCDP2LaunchDeviceV2_ptsz 2025-05-07T20:27:09.6991379Z #define _GLIBCXX_HAVE_ENDIAN_H 1 2025-05-07T20:27:09.6991754Z #define __always_inline __inline __attribute__ ((__always_inline__)) 2025-05-07T20:27:09.6992141Z #define NFDBITS __NFDBITS 2025-05-07T20:27:09.6992406Z #define _PSTL_PRAGMA_FORCEINLINE 2025-05-07T20:27:09.6992696Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1 2025-05-07T20:27:09.6993023Z #define __glibcxx_requires_sorted(_First,_Last) 2025-05-07T20:27:09.6993349Z #define __SHRT_MAX__ 0x7fff 2025-05-07T20:27:09.6993602Z #define _GLIBCXX_SYMVER_GNU 1 2025-05-07T20:27:09.6993891Z #define w_stopval __wait_stopped.__w_stopval 2025-05-07T20:27:09.6994203Z #define STA_UNSYNC 0x0040 2025-05-07T20:27:09.6994516Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:27:09.6994938Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX 2025-05-07T20:27:09.6995306Z #define __FLT64X_MAX_10_EXP__ 4932 2025-05-07T20:27:09.6995591Z #define __cpp_if_constexpr 201606L 2025-05-07T20:27:09.6995908Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 2025-05-07T20:27:09.6996285Z #define cudaStreamFireAndForget ((cudaStream_t)0x4) 2025-05-07T20:27:09.6996633Z #define _GLIBCXX_HAVE_WCHAR_H 1 2025-05-07T20:27:09.6996943Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO 2025-05-07T20:27:09.6997281Z #define __daddr_t_defined 2025-05-07T20:27:09.6997532Z #define __LDBL_IS_IEC_60559__ 2 2025-05-07T20:27:09.6997796Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1 2025-05-07T20:27:09.6998114Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1 2025-05-07T20:27:09.6998623Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800)) 2025-05-07T20:27:09.6999109Z #define _ACRTIMP 2025-05-07T20:27:09.6999338Z #define _IO_EOF_SEEN 0x10 2025-05-07T20:27:09.6999605Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1 2025-05-07T20:27:09.6999892Z #define _IOS_BIN 128 2025-05-07T20:27:09.7000245Z #define __fortify_function __extern_always_inline __attribute_artificial__ 2025-05-07T20:27:09.7000667Z #define __FLT64X_HAS_QUIET_NAN__ 1 2025-05-07T20:27:09.7000947Z #define UNDERFLOW 4 2025-05-07T20:27:09.7001166Z #define NAME_MAX 255 2025-05-07T20:27:09.7001511Z #define SCHAR_MAX __SCHAR_MAX__ 2025-05-07T20:27:09.7001787Z #define __UINT_LEAST8_MAX__ 0xff 2025-05-07T20:27:09.7002064Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2 2025-05-07T20:27:09.7002365Z #define _IO_UNIFIED_JUMPTABLES 1 2025-05-07T20:27:09.7002747Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128 2025-05-07T20:27:09.7003136Z #define __ptr_t void * 2025-05-07T20:27:09.7003376Z #define M_E 2.7182818284590452354 2025-05-07T20:27:09.7003656Z #define cudaSurfaceType1D 0x01 2025-05-07T20:27:09.7003917Z #define __USE_ISOCXX11 1 2025-05-07T20:27:09.7004289Z #define __UINTMAX_TYPE__ long unsigned int 2025-05-07T20:27:09.7004617Z #define cudaDeviceBlockingSync 0x04 2025-05-07T20:27:09.7004919Z #define CLOCK_MONOTONIC_COARSE 6 2025-05-07T20:27:09.7005205Z #define _GLIBCXX_OS_DEFINES 1 2025-05-07T20:27:09.7005504Z #define _GLIBCXX_NODISCARD [[__nodiscard__]] 2025-05-07T20:27:09.7005828Z #define cudaSurfaceType2D 0x02 2025-05-07T20:27:09.7006091Z #define __linux 1 2025-05-07T20:27:09.7006333Z #define __DEC32_EPSILON__ 1E-6DF 2025-05-07T20:27:09.7006614Z #define cudaDeviceMask 0xff 2025-05-07T20:27:09.7006880Z #define _GLIBCXX_END_NAMESPACE_ALGO 2025-05-07T20:27:09.7007180Z #define __CUDA_API_VER_MAJOR__ 12 2025-05-07T20:27:09.7007463Z #define htobe16(x) __bswap_16 (x) 2025-05-07T20:27:09.7007748Z #define HUGE_VALF (__builtin_huge_valf()) 2025-05-07T20:27:09.7008067Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0 2025-05-07T20:27:09.7008374Z #define HUGE_VALL (__builtin_huge_vall()) 2025-05-07T20:27:09.7008669Z #define _BITS_TYPES_H 1 2025-05-07T20:27:09.7008960Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL) 2025-05-07T20:27:09.7009307Z #define _IO_cleanup_region_end(_Doit) 2025-05-07T20:27:09.7009606Z #define cudaSurfaceType3D 0x03 2025-05-07T20:27:09.7009890Z #define _GLIBCXX_HAVE_SYS_TIME_H 1 2025-05-07T20:27:09.7010186Z #define __cudaGet_blockIdx() blockIdx 2025-05-07T20:27:09.7010481Z #define _IO_DONT_CLOSE 0100000 2025-05-07T20:27:09.7011283Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib) 2025-05-07T20:27:09.7012128Z #define cudaHostRegisterDefault 0x00 2025-05-07T20:27:09.7012445Z #define __unix 1 2025-05-07T20:27:09.7012688Z #define MATH_ERRNO 1 2025-05-07T20:27:09.7012935Z #define _GLIBCXX_STDIO_SEEK_END 2 2025-05-07T20:27:09.7013216Z #define _GLIBCXX_USE_FCHMODAT 1 2025-05-07T20:27:09.7013482Z #define __UINT32_MAX__ 0xffffffffU 2025-05-07T20:27:09.7013773Z #define __GXX_EXPERIMENTAL_CXX0X__ 1 2025-05-07T20:27:09.7014071Z #define __UID_T_TYPE __U32_TYPE 2025-05-07T20:27:09.7014358Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1 2025-05-07T20:27:09.7015014Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10)) 2025-05-07T20:27:09.7015494Z #define __nv_pure__ __location__(nv_pure) 2025-05-07T20:27:09.7015797Z #define CUDARTAPI_CDECL 2025-05-07T20:27:09.7016055Z #define _PSTL_USAGE_WARNINGS 0 2025-05-07T20:27:09.7016331Z #define _GLIBCXX98_USE_C99_COMPLEX 1 2025-05-07T20:27:09.7016626Z #define __cpp_lib_void_t 201411 2025-05-07T20:27:09.7016886Z #define _POSIX_AIO_MAX 1 2025-05-07T20:27:09.7017123Z #define __SIZE_T 2025-05-07T20:27:09.7017377Z #define isgraph_l(c,l) __isgraph_l ((c), (l)) 2025-05-07T20:27:09.7017699Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0 2025-05-07T20:27:09.7018000Z #define _POSIX_PIPE_BUF 512 2025-05-07T20:27:09.7018272Z #define _GLIBCXX_HAVE_STRTOLD 1 2025-05-07T20:27:09.7018537Z #define _ATFILE_SOURCE 1 2025-05-07T20:27:09.7018922Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false) 2025-05-07T20:27:09.7019368Z #define __WAIT_STATUS void * 2025-05-07T20:27:09.7019638Z #define __MATH_FUNCTIONS_H__ 2025-05-07T20:27:09.7019969Z #define _GLIBCXX_HAVE_WCSTOF 1 2025-05-07T20:27:09.7020299Z #define __FLT128_MIN_EXP__ (-16381) 2025-05-07T20:27:09.7020592Z #define _GLIBCXX_HAVE_LC_MESSAGES 1 2025-05-07T20:27:09.7020976Z #define __WINT_MIN__ 0U 2025-05-07T20:27:09.7021553Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L) 2025-05-07T20:27:09.7022204Z #define isdigit_l(c,l) __isdigit_l ((c), (l)) 2025-05-07T20:27:09.7022500Z #define WUNTRACED 2 2025-05-07T20:27:09.7022742Z #define _GLIBCXX_HAVE_SQRTF 1 2025-05-07T20:27:09.7023032Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8 2025-05-07T20:27:09.7023319Z #define NZERO 20 2025-05-07T20:27:09.7023545Z #define _GLIBCXX_HAVE_MEMALIGN 1 2025-05-07T20:27:09.7023833Z #define _PSTL_PRAGMA(x) _Pragma(#x) 2025-05-07T20:27:09.7024236Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT 2025-05-07T20:27:09.7024527Z #define MOD_CLKB ADJ_TICK 2025-05-07T20:27:09.7024788Z #define __FLT128_MIN_10_EXP__ (-4931) 2025-05-07T20:27:09.7025080Z #define __FLT32X_IS_IEC_60559__ 2 2025-05-07T20:27:09.7025354Z #define __DEVICE_FUNCTIONS_H__ 2025-05-07T20:27:09.7025993Z #define SCHAR_MIN (-SCHAR_MAX - 1) 2025-05-07T20:27:09.7026279Z #define EXIT_FAILURE 1 2025-05-07T20:27:09.7026510Z #define ADJ_MAXERROR 0x0004 2025-05-07T20:27:09.7026775Z #define __INT_LEAST16_WIDTH__ 16 2025-05-07T20:27:09.7027047Z #define _SIZE_T_DEFINED_ 2025-05-07T20:27:09.7027295Z #define _POSIX_AIO_LISTIO_MAX 2 2025-05-07T20:27:09.7027582Z #define __cudaCDP2DeviceGetLimit 2025-05-07T20:27:09.7027927Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW 2025-05-07T20:27:09.7028293Z #define __cudaCDP2FuncGetAttributes 2025-05-07T20:27:09.7028585Z #define __SCHAR_MAX__ 0x7f 2025-05-07T20:27:09.7028840Z #define __FLT128_MANT_DIG__ 113 2025-05-07T20:27:09.7029118Z #define __USING_NAMESPACE_STD(name) 2025-05-07T20:27:09.7029411Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1 2025-05-07T20:27:09.7029721Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1) 2025-05-07T20:27:09.7030015Z #define SEEK_DATA 3 2025-05-07T20:27:09.7030241Z #define __KERNEL_STRICT_NAMES 2025-05-07T20:27:09.7030537Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_)) 2025-05-07T20:27:09.7030967Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0) 2025-05-07T20:27:09.7031359Z #define _FUNCTEXCEPT_H 1 2025-05-07T20:27:09.7031610Z #define __INT64_C(c) c ## L 2025-05-07T20:27:09.7031881Z #define __NTH(fct) __LEAF_ATTR fct throw () 2025-05-07T20:27:09.7032222Z #define _GLIBCXX_CONST __attribute__ ((__const__)) 2025-05-07T20:27:09.7032542Z #define _GLIBCXX_HAVE_LINK 1 2025-05-07T20:27:09.7032819Z #define cudaNvSciSyncAttrWait 0x2 2025-05-07T20:27:09.7033120Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2 2025-05-07T20:27:09.7033421Z #define STA_PPSWANDER 0x0400 2025-05-07T20:27:09.7033682Z #define __INT_WCHAR_T_H 2025-05-07T20:27:09.7033920Z #define WSTOPPED 2 2025-05-07T20:27:09.7034148Z #define _POSIX_THREAD_THREADS_MAX 64 2025-05-07T20:27:09.7034435Z #define _POSIX_MQ_OPEN_MAX 8 2025-05-07T20:27:09.7034691Z #define FP_NORMAL 4 2025-05-07T20:27:09.7034928Z #define __cudaCDP2LaunchDevice_ptsz 2025-05-07T20:27:09.7035218Z #define _BITS_TIMEX_H 1 2025-05-07T20:27:09.7035455Z #define _POSIX_LINK_MAX 8 2025-05-07T20:27:09.7035713Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1 2025-05-07T20:27:09.7036002Z #define _GLIBCXX_HAVE_ATAN2F 1 2025-05-07T20:27:09.7036275Z #define cudaTextureType1D 0x01 2025-05-07T20:27:09.7036544Z #define _GLIBCXX_HAVE_ATAN2L 1 2025-05-07T20:27:09.7036812Z #define COLL_WEIGHTS_MAX 255 2025-05-07T20:27:09.7037086Z #define __isascii(c) (((c) & ~0x7f) == 0) 2025-05-07T20:27:09.7037391Z #define __toascii(c) ((c) & 0x7f) 2025-05-07T20:27:09.7037822Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b))) 2025-05-07T20:27:09.7038284Z #define _IO_MAGIC 0xFBAD0000 2025-05-07T20:27:09.7038559Z #define _GLIBCXX_USE_SENDFILE 1 2025-05-07T20:27:09.7038816Z #define _POSIX_SOURCE 1 2025-05-07T20:27:09.7039072Z #define cudaTextureType2D 0x02 2025-05-07T20:27:09.7039341Z #define _PTR_TRAITS_H 1 2025-05-07T20:27:09.7039607Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE) 2025-05-07T20:27:09.7039930Z #define _GLIBCXX_HAVE_POWF 1 2025-05-07T20:27:09.7040447Z #define _POSIX2_BC_STRING_MAX 1000 2025-05-07T20:27:09.7040776Z #define __attribute_used__ __attribute__ ((__used__)) 2025-05-07T20:27:09.7041120Z #define cudaTextureType3D 0x03 2025-05-07T20:27:09.7041395Z #define _STDIO_USES_IOSTREAM 2025-05-07T20:27:09.7041651Z #define CLOCK_REALTIME 0 2025-05-07T20:27:09.7041900Z #define __FLT32X_MANT_DIG__ 53 2025-05-07T20:27:09.7042179Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2 2025-05-07T20:27:09.7042518Z #define __cpp_aligned_new 201606L 2025-05-07T20:27:09.7042815Z #define __USER_LABEL_PREFIX__ 2025-05-07T20:27:09.7043095Z #define cudaEventBlockingSync 0x01 2025-05-07T20:27:09.7043572Z #define _GLIBCXX_HAVE_TANL 1 2025-05-07T20:27:09.7043844Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1 2025-05-07T20:27:09.7044183Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1 2025-05-07T20:27:09.7044494Z #define _GLIBCXX_USE_C99_FENV_TR1 1 2025-05-07T20:27:09.7044786Z #define __FLT32_MAX_10_EXP__ 38 2025-05-07T20:27:09.7045043Z #define __GLIBC__ 2 2025-05-07T20:27:09.7045267Z #define __END_DECLS } 2025-05-07T20:27:09.7045524Z #define FP_ILOGB0 (-2147483647 - 1) 2025-05-07T20:27:09.7045884Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x 2025-05-07T20:27:09.7046272Z #define __CONCAT(x,y) x ## y 2025-05-07T20:27:09.7046526Z #define WCONTINUED 8 2025-05-07T20:27:09.7046752Z #define __STDC_HOSTED__ 1 2025-05-07T20:27:09.7047015Z #define _GLIBCXX_HAVE_ARPA_INET_H 1 2025-05-07T20:27:09.7047289Z #define _ALLOCA_H 1 2025-05-07T20:27:09.7047514Z #define __host__ __location__(host) 2025-05-07T20:27:09.7048062Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg))) 2025-05-07T20:27:09.7048527Z #define __SLONG32_TYPE int 2025-05-07T20:27:09.7102181Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1 2025-05-07T20:27:09.7102518Z #define _SYS_SELECT_H 1 2025-05-07T20:27:09.7102780Z #define _IO_LINE_BUF 0x200 2025-05-07T20:27:09.7103030Z #define _IOS_NOCREATE 32 2025-05-07T20:27:09.7103280Z #define __DEC64_MIN_EXP__ (-382) 2025-05-07T20:27:09.7103550Z #define __cudaGet_warpSize() warpSize 2025-05-07T20:27:09.7103869Z #define __SSIZE_T_TYPE __SWORD_TYPE 2025-05-07T20:27:09.7104162Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0 2025-05-07T20:27:09.7104459Z #define __global__ __location__(global) 2025-05-07T20:27:09.7104756Z #define __GNU_LIBRARY__ 6 2025-05-07T20:27:09.7105013Z #define __cpp_decltype_auto 201304L 2025-05-07T20:27:09.7105294Z #define __DBL_DIG__ 15 2025-05-07T20:27:09.7105532Z #define TIME_UTC 1 2025-05-07T20:27:09.7105749Z #define __FLT32_DIG__ 6 2025-05-07T20:27:09.7106079Z #define __forceinline__ __inline__ __attribute__((always_inline)) 2025-05-07T20:27:09.7106518Z #define cudaHostAllocWriteCombined 0x04 2025-05-07T20:27:09.7106948Z #define cudaDeviceScheduleAuto 0x00 2025-05-07T20:27:09.7107287Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l)) 2025-05-07T20:27:09.7107601Z #define _G_BUFSIZ 8192 2025-05-07T20:27:09.7107918Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F 2025-05-07T20:27:09.7108299Z #define cudaTextureTypeCubemap 0x0C 2025-05-07T20:27:09.7108684Z #define __cudaCDP2GetDevice 2025-05-07T20:27:09.7108990Z #define __cudaCDP2PeekAtLastError 2025-05-07T20:27:09.7109283Z #define STA_CLOCKERR 0x1000 2025-05-07T20:27:09.7109539Z #define __GXX_WEAK__ 1 2025-05-07T20:27:09.7109803Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:09.7110112Z #define _GLIBCXX_HAVE_ISNANF 1 2025-05-07T20:27:09.7110378Z #define __SHRT_WIDTH__ 16 2025-05-07T20:27:09.7110676Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304 2025-05-07T20:27:09.7111031Z #define _GLIBCXX_BITS_SPECFUN_H 1 2025-05-07T20:27:09.7111312Z #define _GLIBCXX_HAVE_ISNANL 1 2025-05-07T20:27:09.7111605Z #define isblank_l(c,l) __isblank_l ((c), (l)) 2025-05-07T20:27:09.7111907Z #define _G_config_h 1 2025-05-07T20:27:09.7112177Z #define M_LOG2El 1.442695040888963407359924681001892137L 2025-05-07T20:27:09.7112563Z #define ADJ_OFFSET_SINGLESHOT 0x8001 2025-05-07T20:27:09.7112865Z #define _GCC_WCHAR_T 2025-05-07T20:27:09.7113103Z #define TMP_MAX 238328 2025-05-07T20:27:09.7113346Z #define __FLT32_IS_IEC_60559__ 2 2025-05-07T20:27:09.7113875Z #define __DEVICE_TYPES_H__ 2025-05-07T20:27:09.7114143Z #define __DEV_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:09.7114416Z #define _EXT_NUMERIC_TRAITS 1 2025-05-07T20:27:09.7114690Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 2025-05-07T20:27:09.7114978Z #define _IO_SKIPWS 01 2025-05-07T20:27:09.7115379Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000 2025-05-07T20:27:09.7115847Z #define _IO_SCIENTIFIC 04000 2025-05-07T20:27:09.7116115Z #define _GLIBCXX_HAVE_STRING_H 1 2025-05-07T20:27:09.7116444Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L 2025-05-07T20:27:09.7116955Z #define cudaDeviceScheduleSpin 0x01 2025-05-07T20:27:09.7117329Z #define __nonnull(params) __attribute__ ((__nonnull__ params)) 2025-05-07T20:27:09.7117691Z #define __DBL_IS_IEC_60559__ 2 2025-05-07T20:27:09.7117952Z #define le32toh(x) (x) 2025-05-07T20:27:09.7118184Z #define _SIZE_T_DEFINED 2025-05-07T20:27:09.7118438Z #define _GLIBCXX_HAVE_XLOCALE_H 1 2025-05-07T20:27:09.7118782Z #define cudaArraySparsePropertiesSingleMipTail 0x1 2025-05-07T20:27:09.7119141Z #define __DEC32_MAX__ 9.999999E96DF 2025-05-07T20:27:09.7119543Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0) 2025-05-07T20:27:09.7119962Z #define _GLIBCXX_HAVE_FMODL 1 2025-05-07T20:27:09.7120232Z #define _GLIBCXX_HAVE_POLL 1 2025-05-07T20:27:09.7120494Z #define __SM_32_INTRINSICS_H__ 2025-05-07T20:27:09.7120754Z #define _POSIX_NAME_MAX 14 2025-05-07T20:27:09.7121035Z #define __cpp_threadsafe_static_init 200806L 2025-05-07T20:27:09.7121570Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter) 2025-05-07T20:27:09.7122084Z #define _GLIBCXX_USE_CLOCK_REALTIME 1 2025-05-07T20:27:09.7122394Z #define __cpp_enumerator_attributes 201411L 2025-05-07T20:27:09.7122745Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG) 2025-05-07T20:27:09.7123067Z #define _WCHAR_T_ 2025-05-07T20:27:09.7123297Z #define _GLIBCXX_FAST_MATH 0 2025-05-07T20:27:09.7123671Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x 2025-05-07T20:27:09.7124075Z #define RTSIG_MAX 32 2025-05-07T20:27:09.7124294Z #define _STDDEF_H 2025-05-07T20:27:09.7124530Z #define CU_UUID_HAS_BEEN_DEFINED 2025-05-07T20:27:09.7124808Z #define _VA_LIST_DEFINED 2025-05-07T20:27:09.7125060Z #define __FLT32X_HAS_INFINITY__ 1 2025-05-07T20:27:09.7126607Z #define __glibcxx_requires_non_empty_range(_First,_Last) 2025-05-07T20:27:09.7127033Z #define __grid_constant__ __location__(grid_constant) 2025-05-07T20:27:09.7127364Z #define __INT32_MAX__ 0x7fffffff 2025-05-07T20:27:09.7127665Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" { 2025-05-07T20:27:09.7128130Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L) 2025-05-07T20:27:09.7128661Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B)) 2025-05-07T20:27:09.7129024Z #define __SIZEOF_PTHREAD_COND_T 48 2025-05-07T20:27:09.7129345Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 2025-05-07T20:27:09.7129669Z #define __unix__ 1 2025-05-07T20:27:09.7129894Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:09.7130179Z #define __INT_WIDTH__ 32 2025-05-07T20:27:09.7130422Z #define __SIZEOF_LONG__ 8 2025-05-07T20:27:09.7130653Z #define _IONBF 2 2025-05-07T20:27:09.7131102Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib) 2025-05-07T20:27:09.7131881Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++) 2025-05-07T20:27:09.7132447Z #define __STDC_IEC_559__ 1 2025-05-07T20:27:09.7132728Z #define __STDC_ISO_10646__ 201103L 2025-05-07T20:27:09.7133000Z #define __UINT16_C(c) c 2025-05-07T20:27:09.7133241Z #define M_2_PI 0.63661977236758134308 2025-05-07T20:27:09.7133511Z #define STA_DEL 0x0020 2025-05-07T20:27:09.7133754Z #define __CUDACC_VER_MINOR__ 6 2025-05-07T20:27:09.7134008Z #define __id_t_defined 2025-05-07T20:27:09.7134528Z #define w_retcode __wait_terminated.__w_retcode 2025-05-07T20:27:09.7135109Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base) 2025-05-07T20:27:09.7135543Z #define _GLIBCXX_HAVE_MODFF 1 2025-05-07T20:27:09.7135811Z #define _GLIBCXX_HAVE_MODFL 1 2025-05-07T20:27:09.7136064Z #define __DECIMAL_DIG__ 21 2025-05-07T20:27:09.7136321Z #define _POSIX2_RE_DUP_MAX 255 2025-05-07T20:27:09.7136586Z #define __USE_FORTIFY_LEVEL 0 2025-05-07T20:27:09.7136845Z #define __STDC_IEC_559_COMPLEX__ 1 2025-05-07T20:27:09.7137110Z #define SING 2 2025-05-07T20:27:09.7137324Z #define STA_FREQHOLD 0x0080 2025-05-07T20:27:09.7137732Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:09.7138040Z #define cudaStreamDefault 0x00 2025-05-07T20:27:09.7138427Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64 2025-05-07T20:27:09.7138887Z #define _GLIBCXX_HAVE_HYPOTL 1 2025-05-07T20:27:09.7139164Z #define _GLIBCXX_HAVE_SYS_UIO_H 1 2025-05-07T20:27:09.7139439Z #define __gnu_linux__ 1 2025-05-07T20:27:09.7139678Z #define __INT16_MAX__ 0x7fff 2025-05-07T20:27:09.7139941Z #define _LARGEFILE_SOURCE 1 2025-05-07T20:27:09.7140197Z #define MAX_INPUT 255 2025-05-07T20:27:09.7140430Z #define __FLT64_MIN_EXP__ (-1021) 2025-05-07T20:27:09.7140764Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l)) 2025-05-07T20:27:09.7141140Z #define __glibcxx_requires_heap(_First,_Last) 2025-05-07T20:27:09.7141466Z #define _GLIBCXX_CPU_DEFINES 1 2025-05-07T20:27:09.7141787Z #define _GLIBCXX_HAVE_POLL_H 1 2025-05-07T20:27:09.7142195Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__)) 2025-05-07T20:27:09.7142685Z #define _IO_SHOWPOS 02000 2025-05-07T20:27:09.7143009Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1 2025-05-07T20:27:09.7143380Z #define _Mfloat_ float 2025-05-07T20:27:09.7143642Z #define __glibcxx_requires_cond(_Cond,_Msg) 2025-05-07T20:27:09.7143948Z #define __FLT64X_MIN_10_EXP__ (-4931) 2025-05-07T20:27:09.7144239Z #define DELAYTIMER_MAX 2147483647 2025-05-07T20:27:09.7144742Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0) 2025-05-07T20:27:09.7145246Z #define __LDBL_HAS_QUIET_NAN__ 1 2025-05-07T20:27:09.7145518Z #define _GLIBCXX98_USE_C99_STDIO 1 2025-05-07T20:27:09.7145847Z #define cudaKernelNodeAttrID cudaLaunchAttributeID 2025-05-07T20:27:09.7146210Z #define __glibcxx_class_requires2(_a,_b,_c) 2025-05-07T20:27:09.7146503Z #define __USE_ISOC11 1 2025-05-07T20:27:09.7146738Z #define _BSD_SIZE_T_ 2025-05-07T20:27:09.7146967Z #define ADJ_MICRO 0x1000 2025-05-07T20:27:09.7147216Z #define _GLIBCXX_HAVE_FABSF 1 2025-05-07T20:27:09.7147478Z #define _GLIBCXX_HAVE_FABSL 1 2025-05-07T20:27:09.7147775Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd) 2025-05-07T20:27:09.7148091Z #define __FLT64_MANT_DIG__ 53 2025-05-07T20:27:09.7148413Z #define __attribute_const__ __attribute__ ((__const__)) 2025-05-07T20:27:09.7148756Z #define __THROW throw () 2025-05-07T20:27:09.7149023Z #define __cudaGet_gridDim() gridDim 2025-05-07T20:27:09.7149317Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:09.7149682Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 2025-05-07T20:27:09.7150047Z #define htobe32(x) __bswap_32 (x) 2025-05-07T20:27:09.7150326Z #define _GLIBCXX_HAVE_POWL 1 2025-05-07T20:27:09.7150597Z #define __FLT64X_MANT_DIG__ 64 2025-05-07T20:27:09.7150869Z #define __GLIBC_HAVE_LONG_LONG 1 2025-05-07T20:27:09.7151134Z #define L_tmpnam 20 2025-05-07T20:27:09.7151366Z #define ___int_wchar_t_h 2025-05-07T20:27:09.7151723Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status)) 2025-05-07T20:27:09.7152123Z #define isascii(c) __isascii (c) 2025-05-07T20:27:09.7152390Z #define _T_PTRDIFF 2025-05-07T20:27:09.7152701Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp) 2025-05-07T20:27:09.7153076Z #define toascii(c) __toascii (c) 2025-05-07T20:27:09.7153340Z #define __GNUC__ 11 2025-05-07T20:27:09.7153600Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE 2025-05-07T20:27:09.7154044Z #define __GXX_RTTI 1 2025-05-07T20:27:09.7154269Z #define __pie__ 2 2025-05-07T20:27:09.7154484Z #define __MMX__ 1 2025-05-07T20:27:09.7154714Z #define __cudaCDP2Malloc 2025-05-07T20:27:09.7154971Z #define __timespec_defined 1 2025-05-07T20:27:09.7155225Z #define L_ctermid 9 2025-05-07T20:27:09.7155457Z #define __OFF64_T_TYPE __SQUAD_TYPE 2025-05-07T20:27:09.7155759Z #define __cudaCDP2GetParameterBufferV2 2025-05-07T20:27:09.7156155Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER) 2025-05-07T20:27:09.7156534Z #define _BITS_POSIX2_LIM_H 1 2025-05-07T20:27:09.7156883Z #define _GLIBCXX98_USE_C99_STDLIB 1 2025-05-07T20:27:09.7157180Z #define cudaMemAttachGlobal 0x01 2025-05-07T20:27:09.7157495Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp) 2025-05-07T20:27:09.7157807Z #define __FLT_HAS_DENORM__ 1 2025-05-07T20:27:09.7158074Z #define __SIZEOF_LONG_DOUBLE__ 16 2025-05-07T20:27:09.7158516Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1) 2025-05-07T20:27:09.7159273Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:27:09.7159879Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE 2025-05-07T20:27:09.7160191Z #define __USE_SVID 1 2025-05-07T20:27:09.7160442Z #define __constant__ __location__(constant) 2025-05-07T20:27:09.7160751Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1 2025-05-07T20:27:09.7161051Z #define __device__ __location__(device) 2025-05-07T20:27:09.7161379Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1 2025-05-07T20:27:09.7161715Z #define _GLIBCXX_RES_LIMITS 1 2025-05-07T20:27:09.7161973Z #define M_1_PI 0.31830988618379067154 2025-05-07T20:27:09.7162265Z #define CUDART_DEVICE __device__ 2025-05-07T20:27:09.7162612Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW 2025-05-07T20:27:09.7162976Z #define M_PI_2 1.57079632679489661923 2025-05-07T20:27:09.7163261Z #define __BIGGEST_ALIGNMENT__ 16 2025-05-07T20:27:09.7163637Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02 2025-05-07T20:27:09.7164019Z #define __STDC_UTF_16__ 1 2025-05-07T20:27:09.7164271Z #define LONG_MAX __LONG_MAX__ 2025-05-07T20:27:09.7164640Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136) 2025-05-07T20:27:09.7165062Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4 2025-05-07T20:27:09.7165380Z #define _POSIX_HOST_NAME_MAX 255 2025-05-07T20:27:09.7165651Z #define __FLT64_MAX_10_EXP__ 308 2025-05-07T20:27:09.7165917Z #define NGROUPS_MAX 65536 2025-05-07T20:27:09.7166166Z #define _GLIBCXX_NAMESPACE_LDBL 2025-05-07T20:27:09.7166437Z #define __USE_ISOC95 1 2025-05-07T20:27:09.7166664Z #define _TIME_H 1 2025-05-07T20:27:09.7166924Z #define M_LOG10El 0.434294481903251827651128918916605082L 2025-05-07T20:27:09.7167289Z #define __USE_ISOC99 1 2025-05-07T20:27:09.7167695Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname) 2025-05-07T20:27:09.7168135Z #define HOST_NAME_MAX 64 2025-05-07T20:27:09.7168400Z #define _POSIX_SEM_NSEMS_MAX 256 2025-05-07T20:27:09.7168665Z #define _IOS_ATEND 4 2025-05-07T20:27:09.7168895Z #define __SM_35_INTRINSICS_H__ 2025-05-07T20:27:09.7169228Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status)) 2025-05-07T20:27:09.7169640Z #define cudaStreamAttrValue cudaLaunchAttributeValue 2025-05-07T20:27:09.7169985Z #define _GLIBCXX_HAVE_S_ISREG 1 2025-05-07T20:27:09.7170273Z #define cudaSurfaceTypeCubemap 0x0C 2025-05-07T20:27:09.7170599Z #define __cpp_delegating_constructors 200604L 2025-05-07T20:27:09.7170930Z #define __FLT32_HAS_INFINITY__ 1 2025-05-07T20:27:09.7171190Z #define _STDIO_H 1 2025-05-07T20:27:09.7171599Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type) 2025-05-07T20:27:09.7172082Z #define _GLIBCXX_PREDEFINED_OPS_H 1 2025-05-07T20:27:09.7172453Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:27:09.7172840Z #define _G_IO_IO_FILE_VERSION 0x20001 2025-05-07T20:27:09.7173139Z #define _POSIX_SIGQUEUE_MAX 32 2025-05-07T20:27:09.7173561Z #define _GLIBCXX_HAVE_GETS 1 2025-05-07T20:27:09.7173839Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1 2025-05-07T20:27:09.7174137Z #define __cpp_raw_strings 200710L 2025-05-07T20:27:09.7174442Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:09.7174895Z #define _GLIBCXX_HAVE_VFWSCANF 1 2025-05-07T20:27:09.7175170Z #define __DBL_HAS_INFINITY__ 1 2025-05-07T20:27:09.7175453Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L 2025-05-07T20:27:09.7175753Z #define _GLIBCXX_STDIO_EOF -1 2025-05-07T20:27:09.7176026Z #define __SIZEOF_PTHREAD_MUTEX_T 40 2025-05-07T20:27:09.7176410Z #define __CHANNEL_DESCRIPTOR_H__ 2025-05-07T20:27:09.7176764Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8)) 2025-05-07T20:27:09.7177139Z #define __SIZEOF_FLOAT__ 4 2025-05-07T20:27:09.7177384Z #define __USE_XOPEN 1 2025-05-07T20:27:09.7177623Z #define __SIZEOF_PTHREAD_RWLOCK_T 56 2025-05-07T20:27:09.7178067Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:27:09.7178524Z #define __USE_XOPEN2K 1 2025-05-07T20:27:09.7178765Z #define _PSTL_UDR_PRESENT 1 2025-05-07T20:27:09.7179026Z #define __HAVE_SPECULATION_SAFE_VALUE 1 2025-05-07T20:27:09.7179325Z #define _GLIBCXX_HAVE_COSF 1 2025-05-07T20:27:09.7179597Z #define __cpp_fold_expressions 201603L 2025-05-07T20:27:09.7180113Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2) 2025-05-07T20:27:09.7180644Z #define NL_LANGMAX _POSIX2_LINE_MAX 2025-05-07T20:27:09.7180928Z #define __DEC32_MIN_EXP__ (-94) 2025-05-07T20:27:09.7181281Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 2025-05-07T20:27:09.7181680Z #define __DADDR_T_TYPE __S32_TYPE 2025-05-07T20:27:09.7182064Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01 2025-05-07T20:27:09.7182465Z #define __END_NAMESPACE_C99 2025-05-07T20:27:09.7182735Z #define __glibcxx_integral_traps true 2025-05-07T20:27:09.7183025Z #define _POSIX_PATH_MAX 256 2025-05-07T20:27:09.7183286Z #define __INTPTR_WIDTH__ 64 2025-05-07T20:27:09.7183540Z #define __FLT64X_HAS_INFINITY__ 1 2025-05-07T20:27:09.7183810Z #define _ISOC11_SOURCE 1 2025-05-07T20:27:09.7184063Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1 2025-05-07T20:27:09.7184348Z #define __UINT_LEAST32_MAX__ 0xffffffffU 2025-05-07T20:27:09.7184646Z #define _GLIBCXX_HAVE_QUICK_EXIT 1 2025-05-07T20:27:09.7185013Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 2025-05-07T20:27:09.7185398Z #define LONG_MIN (-LONG_MAX - 1L) 2025-05-07T20:27:09.7185675Z #define _GLIBCXX_HAVE_SINCOSF 1 2025-05-07T20:27:09.7185937Z #define _IO_UNITBUF 020000 2025-05-07T20:27:09.7186190Z #define _GLIBCXX_HAVE_SINCOSL 1 2025-05-07T20:27:09.7186453Z #define __FD_SETSIZE 1024 2025-05-07T20:27:09.7186706Z #define getc(_fp) _IO_getc (_fp) 2025-05-07T20:27:09.7186983Z #define be32toh(x) __bswap_32 (x) 2025-05-07T20:27:09.7187324Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused" 2025-05-07T20:27:09.7187686Z #define __FLT32X_HAS_DENORM__ 1 2025-05-07T20:27:09.7187962Z #define __INT_FAST16_TYPE__ long int 2025-05-07T20:27:09.7188269Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l)) 2025-05-07T20:27:09.7188597Z #define _GLIBCXX_HAVE_GETIPINFO 1 2025-05-07T20:27:09.7188875Z #define __MMX_WITH_SSE__ 1 2025-05-07T20:27:09.7189173Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l)) 2025-05-07T20:27:09.7189514Z #define _WCHAR_T_DEFINED_ 2025-05-07T20:27:09.7189805Z #define cudaIpcMemLazyEnablePeerAccess 0x01 2025-05-07T20:27:09.7190130Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1 2025-05-07T20:27:09.7190423Z #define __INO_T_MATCHES_INO64_T 1 2025-05-07T20:27:09.7190708Z #define __USE_POSIX199506 1 2025-05-07T20:27:09.7190958Z #define _FEATURES_H 1 2025-05-07T20:27:09.7191197Z #define __LDBL_HAS_DENORM__ 1 2025-05-07T20:27:09.7191591Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM)) 2025-05-07T20:27:09.7192017Z #define __stub_getmsg 2025-05-07T20:27:09.7192245Z #define _IO_FIXED 010000 2025-05-07T20:27:09.7192615Z #define __cpp_lib_addressof_constexpr 201603 2025-05-07T20:27:09.7192931Z #define _GLIBCXX11_USE_C99_STDIO 1 2025-05-07T20:27:09.7193198Z #define __stub_setlogin 2025-05-07T20:27:09.7193436Z #define __stub_fattach 2025-05-07T20:27:09.7193676Z #define __cplusplus 201703L 2025-05-07T20:27:09.7193940Z #define __cpp_ref_qualifiers 200710L 2025-05-07T20:27:09.7194227Z #define _STRUCT_TIMEVAL 1 2025-05-07T20:27:09.7194481Z #define INFINITY (__builtin_inff()) 2025-05-07T20:27:09.7194751Z #define _IO_UNBUFFERED 2 2025-05-07T20:27:09.7195236Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy 2025-05-07T20:27:09.7195855Z #define _IO_INTERNAL 010 2025-05-07T20:27:09.7196103Z #define __DEC32_MIN__ 1E-95DF 2025-05-07T20:27:09.7196434Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue 2025-05-07T20:27:09.7196796Z #define __dev_t_defined 2025-05-07T20:27:09.7197036Z #define __DEPRECATED 1 2025-05-07T20:27:09.7197262Z #define __S32_TYPE int 2025-05-07T20:27:09.7197513Z #define __cpp_rvalue_references 200610L 2025-05-07T20:27:09.7197820Z #define __DBL_MAX_EXP__ 1024 2025-05-07T20:27:09.7198076Z #define _IO_fpos_t _G_fpos_t 2025-05-07T20:27:09.7198334Z #define __WCHAR_WIDTH__ 32 2025-05-07T20:27:09.7198937Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout 2025-05-07T20:27:09.7199573Z #define _G_HAVE_MREMAP 1 2025-05-07T20:27:09.7199885Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:27:09.7200235Z #define OVERFLOW 3 2025-05-07T20:27:09.7200474Z #define __toascii_l(c,l) ((l), __toascii (c)) 2025-05-07T20:27:09.7200853Z #define __DEC128_EPSILON__ 1E-33DL 2025-05-07T20:27:09.7201234Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:09.7201583Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11 2025-05-07T20:27:09.7201918Z #define __SSE2_MATH__ 1 2025-05-07T20:27:09.7202167Z #define __ATOMIC_HLE_RELEASE 131072 2025-05-07T20:27:09.7202485Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:09.7202795Z #define _IO_STDIO_H 2025-05-07T20:27:09.7203055Z #define PDP_ENDIAN __PDP_ENDIAN 2025-05-07T20:27:09.7203355Z #define isspace_l(c,l) __isspace_l ((c), (l)) 2025-05-07T20:27:09.7203677Z #define __cudaCDP2Memcpy2DAsync 2025-05-07T20:27:09.7203981Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:09.7204295Z #define _GLIBCXX_HAVE_STRERROR_R 1 2025-05-07T20:27:09.7204561Z #define __amd64 1 2025-05-07T20:27:09.7204788Z #define _POSIX_TZNAME_MAX 6 2025-05-07T20:27:09.7205064Z #define __cudaCDP2Memset3DAsync 2025-05-07T20:27:09.7205351Z #define __SYSCALL_WORDSIZE 64 2025-05-07T20:27:09.7205643Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1 2025-05-07T20:27:09.7205952Z #define _EXT_TYPE_TRAITS 1 2025-05-07T20:27:09.7206224Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1 2025-05-07T20:27:09.7206522Z #define _POSIX_RE_DUP_MAX 255 2025-05-07T20:27:09.7206786Z #define __STDC_NO_THREADS__ 1 2025-05-07T20:27:09.7207044Z #define __bounded 2025-05-07T20:27:09.7207271Z #define __USECONDS_T_TYPE __U32_TYPE 2025-05-07T20:27:09.7207569Z #define _IO_DELETE_DONT_CLOSE 0x40 2025-05-07T20:27:09.7207858Z #define __BEGIN_NAMESPACE_STD 2025-05-07T20:27:09.7208127Z #define _PTRDIFF_T_DECLARED 2025-05-07T20:27:09.7208412Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:09.7208734Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f) 2025-05-07T20:27:09.7209150Z #define cudaStreamAttributePriority cudaLaunchAttributePriority 2025-05-07T20:27:09.7209558Z #define _GLIBCXX_HAVE_NETDB_H 1 2025-05-07T20:27:09.7209835Z #define __SM_20_INTRINSICS_HPP__ 2025-05-07T20:27:09.7210177Z #define __cpp_lib_has_unique_object_representations 201606 2025-05-07T20:27:09.7210529Z #define STA_PLL 0x0001 2025-05-07T20:27:09.7210771Z #define __ATOMIC_HLE_ACQUIRE 65536 2025-05-07T20:27:09.7211044Z #define __GNUG__ 11 2025-05-07T20:27:09.7211271Z #define _GLIBCXX_USE_GET_NPROCS 1 2025-05-07T20:27:09.7211543Z #define _T_WCHAR 2025-05-07T20:27:09.7211778Z #define __cudaCDP2GetDeviceCount 2025-05-07T20:27:09.7212064Z #define __specialization_static 2025-05-07T20:27:09.7212493Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL 2025-05-07T20:27:09.7212816Z #define __SIZEOF_SIZE_T__ 8 2025-05-07T20:27:09.7213073Z #define cudaArraySparse 0x40 2025-05-07T20:27:09.7213340Z #define STA_PPSFREQ 0x0002 2025-05-07T20:27:09.7213590Z #define __GLIBCXX__ 20230528 2025-05-07T20:27:09.7213870Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_)) 2025-05-07T20:27:09.7214177Z #define _WCHAR_T 2025-05-07T20:27:09.7214401Z #define __cudaCDP2Free 2025-05-07T20:27:09.7215218Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0) 2025-05-07T20:27:09.7216035Z #define __cpp_nsdmi 200809L 2025-05-07T20:27:09.7216463Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0) 2025-05-07T20:27:09.7216922Z #define __FLT64X_MIN_EXP__ (-16381) 2025-05-07T20:27:09.7217209Z #define __SIZEOF_WINT_T__ 4 2025-05-07T20:27:09.7217480Z #define cudaArrayCubemap 0x04 2025-05-07T20:27:09.7217816Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:27:09.7218176Z #define _GLIBCXX_UTILITY 1 2025-05-07T20:27:09.7218417Z #define __NO_CTYPE 1 2025-05-07T20:27:09.7218648Z #define __stub_bdflush 2025-05-07T20:27:09.7219025Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter) 2025-05-07T20:27:09.7219451Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 2025-05-07T20:27:09.7219765Z #define _GLIBCXX_STDC_HEADERS 1 2025-05-07T20:27:09.7220036Z #define __LONG_LONG_WIDTH__ 64 2025-05-07T20:27:09.7220320Z #define __cpp_initializer_lists 200806L 2025-05-07T20:27:09.7220635Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1 2025-05-07T20:27:09.7220939Z #define __U16_TYPE unsigned short int 2025-05-07T20:27:09.7221285Z #define __glibcxx_requires_can_increment(_First,_Size) 2025-05-07T20:27:09.7221643Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1 2025-05-07T20:27:09.7221940Z #define __FLT32_MAX_EXP__ 128 2025-05-07T20:27:09.7222226Z #define cudaHostRegisterIoMemory 0x04 2025-05-07T20:27:09.7222585Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS)) 2025-05-07T20:27:09.7222942Z #define __cpp_lib_is_invocable 201703 2025-05-07T20:27:09.7223230Z #define _IO_STDIO 040000 2025-05-07T20:27:09.7223560Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int))) 2025-05-07T20:27:09.7223957Z #define cudaSurfaceType1DLayered 0xF1 2025-05-07T20:27:09.7224280Z #define cudaArraySurfaceLoadStore 0x02 2025-05-07T20:27:09.7224573Z #define _PTRDIFF_T 2025-05-07T20:27:09.7224794Z #define _MOVE_H 1 2025-05-07T20:27:09.7225027Z #define __cpp_hex_float 201603L 2025-05-07T20:27:09.7236532Z #define ADJ_TAI 0x0080 2025-05-07T20:27:09.7236835Z #define __ptrvalue 2025-05-07T20:27:09.7237078Z #define _GLIBCXX_HOSTED 1 2025-05-07T20:27:09.7237344Z #define __GXX_ABI_VERSION 1016 2025-05-07T20:27:09.7237813Z #define __WTERMSIG(status) ((status) & 0x7f) 2025-05-07T20:27:09.7238130Z #define MATH_ERREXCEPT 2 2025-05-07T20:27:09.7238406Z #define _GLIBCXX_HAS_GTHREADS 1 2025-05-07T20:27:09.7238706Z #define cudaTextureType2DLayered 0xF2 2025-05-07T20:27:09.7239109Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0)) 2025-05-07T20:27:09.7239507Z #define __USE_GNU 1 2025-05-07T20:27:09.7239750Z #define __FLT128_HAS_INFINITY__ 1 2025-05-07T20:27:09.7240029Z #define __FLT_MIN_EXP__ (-125) 2025-05-07T20:27:09.7240306Z #define __GCC_HAVE_DWARF2_CFI_ASM 1 2025-05-07T20:27:09.7240708Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d))) 2025-05-07T20:27:09.7241120Z #define WEXITED 4 2025-05-07T20:27:09.7241344Z #define _IO_NO_READS 4 2025-05-07T20:27:09.7241651Z #define cudaGraphKernelNodePortLaunchCompletion 2 2025-05-07T20:27:09.7242008Z #define M_LOG2E 1.4426950408889634074 2025-05-07T20:27:09.7242298Z #define _POSIX_SYMLINK_MAX 255 2025-05-07T20:27:09.7242660Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1 2025-05-07T20:27:09.7242989Z #define __uid_t_defined 2025-05-07T20:27:09.7243647Z #define __FD_ELT(d) ((d) / __NFDBITS) 2025-05-07T20:27:09.7243950Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1 2025-05-07T20:27:09.7244235Z #define WNOHANG 1 2025-05-07T20:27:09.7244480Z #define alloca(size) __builtin_alloca (size) 2025-05-07T20:27:09.7244799Z #define _GLIBCXX_HAVE_HYPOTF 1 2025-05-07T20:27:09.7245075Z #define cudaEventDefault 0x00 2025-05-07T20:27:09.7245379Z #define __maxnreg__(a) __attribute__((maxnreg(a))) 2025-05-07T20:27:09.7245696Z #define NL_SETMAX INT_MAX 2025-05-07T20:27:09.7245933Z #define __x86_64 1 2025-05-07T20:27:09.7246166Z #define __cudaCDP2LaunchDevice 2025-05-07T20:27:09.7246772Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias)) 2025-05-07T20:27:09.7247267Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 { 2025-05-07T20:27:09.7247779Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:27:09.7248230Z #define __PTRDIFF_T 2025-05-07T20:27:09.7248563Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW 2025-05-07T20:27:09.7248954Z #define _GLIBCXX_HAVE_FINITEL 1 2025-05-07T20:27:09.7249235Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:09.7249532Z #define _Mlong_double_ long double 2025-05-07T20:27:09.7249823Z #define __cpp_lambdas 200907L 2025-05-07T20:27:09.7250090Z #define _IO_DEC 020 2025-05-07T20:27:09.7250320Z #define _GLIBCXX_HAVE_SINHL 1 2025-05-07T20:27:09.7250596Z #define _POSIX_CLOCKRES_MIN 20000000 2025-05-07T20:27:09.7250895Z #define __INT_FAST64_TYPE__ long int 2025-05-07T20:27:09.7251182Z #define ADJ_TIMECONST 0x0020 2025-05-07T20:27:09.7251458Z #define _GLIBCXX_HAVE_SQRTL 1 2025-05-07T20:27:09.7251762Z #define __cudaCDP2DeviceGetSharedMemConfig 2025-05-07T20:27:09.7252092Z #define _GLIBCXX_HAVE_STDALIGN_H 1 2025-05-07T20:27:09.7252377Z #define _ANSI_STDDEF_H 2025-05-07T20:27:09.7252669Z #define _GLIBCXX_MOVE(__val) std::move(__val) 2025-05-07T20:27:09.7252994Z #define _GLIBCXX_HAVE_STRERROR_L 1 2025-05-07T20:27:09.7253371Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64 2025-05-07T20:27:09.7253772Z #define _GLIBCXX_USE_DEV_RANDOM 1 2025-05-07T20:27:09.7254068Z #define _STL_ITERATOR_BASE_TYPES_H 1 2025-05-07T20:27:09.7254368Z #define __cpp_template_auto 201606L 2025-05-07T20:27:09.7254913Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L) 2025-05-07T20:27:09.7255301Z #define _GLIBCXX_HAVE_SYS_SEM_H 1 2025-05-07T20:27:09.7255578Z #define __key_t_defined 2025-05-07T20:27:09.7255841Z #define _IO_MAGIC_MASK 0xFFFF0000 2025-05-07T20:27:09.7256225Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__))) 2025-05-07T20:27:09.7256776Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128 2025-05-07T20:27:09.7257436Z #define __GNUC_VA_LIST 2025-05-07T20:27:09.7257785Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:27:09.7258184Z #define __SIZEOF_POINTER__ 8 2025-05-07T20:27:09.7258451Z #define CLOCK_REALTIME_COARSE 5 2025-05-07T20:27:09.7258747Z #define _GLIBCXX14_CONSTEXPR constexpr 2025-05-07T20:27:09.7259052Z #define __USE_XOPEN2KXSI 1 2025-05-07T20:27:09.7259301Z #define __WCOREFLAG 0x80 2025-05-07T20:27:09.7259559Z #define M_2_SQRTPI 1.12837916709551257390 2025-05-07T20:27:09.7259874Z #define cudaEventDisableTiming 0x02 2025-05-07T20:27:09.7260153Z #define __LP64__ 1 2025-05-07T20:27:09.7260403Z #define __isascii_l(c,l) ((l), __isascii (c)) 2025-05-07T20:27:09.7260734Z #define cudaStreamNonBlocking 0x01 2025-05-07T20:27:09.7261019Z #define _IO_off64_t __off64_t 2025-05-07T20:27:09.7261300Z #define __DBL_HAS_QUIET_NAN__ 1 2025-05-07T20:27:09.7261583Z #define __time_t_defined 1 2025-05-07T20:27:09.7261848Z #define _POSIX_SYMLOOP_MAX 8 2025-05-07T20:27:09.7262199Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x 2025-05-07T20:27:09.7262586Z #define __USE_UNIX98 1 2025-05-07T20:27:09.7262840Z #define __MODE_T_TYPE __U32_TYPE 2025-05-07T20:27:09.7263117Z #define CLOCK_REALTIME_ALARM 8 2025-05-07T20:27:09.7263505Z #define _GLIBCXX_HAVE_STRINGS_H 1 2025-05-07T20:27:09.7263816Z #define __LEAF_ATTR __attribute__ ((__leaf__)) 2025-05-07T20:27:09.7264130Z #define __DECIMAL_BID_FORMAT__ 1 2025-05-07T20:27:09.7264392Z #define SEEK_CUR 1 2025-05-07T20:27:09.7264623Z #define __RLIM64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:09.7264893Z #define _ASSERT_H 1 2025-05-07T20:27:09.7265473Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig)) 2025-05-07T20:27:09.7266125Z #define _GLIBCXX_USE_DEPRECATED 1 2025-05-07T20:27:09.7266502Z #define CHAR_MAX SCHAR_MAX 2025-05-07T20:27:09.7266760Z #define _GLIBCXX_HAVE_SETENV 1 2025-05-07T20:27:09.7267037Z #define NL_ARGMAX _POSIX_ARG_MAX 2025-05-07T20:27:09.7267326Z #define _GLIBCXX_USE_UTIMENSAT 1 2025-05-07T20:27:09.7267712Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__)) 2025-05-07T20:27:09.7268142Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 2025-05-07T20:27:09.7268826Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch))) 2025-05-07T20:27:09.7269495Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1 2025-05-07T20:27:09.7269808Z #define _IO_BOOLALPHA 0200000 2025-05-07T20:27:09.7270170Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912) 2025-05-07T20:27:09.7270567Z #define _GLIBCXX_PACKAGE_URL "" 2025-05-07T20:27:09.7270844Z #define __FLT64_MIN_10_EXP__ (-307) 2025-05-07T20:27:09.7271143Z #define cudaArrayDefault 0x00 2025-05-07T20:27:09.7271440Z #define __cudaCDP2LaunchDeviceV2 2025-05-07T20:27:09.7271743Z #define __FDS_BITS(set) ((set)->fds_bits) 2025-05-07T20:27:09.7272051Z #define TLOSS 5 2025-05-07T20:27:09.7272291Z #define __ssize_t_defined 2025-05-07T20:27:09.7272583Z #define __CUDACC_VER_BUILD__ 85 2025-05-07T20:27:09.7272895Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1 2025-05-07T20:27:09.7273205Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL) 2025-05-07T20:27:09.7273514Z #define __FLT64X_DECIMAL_DIG__ 21 2025-05-07T20:27:09.7273890Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11 2025-05-07T20:27:09.7274297Z #define _POSIX_HIWAT _POSIX_PIPE_BUF 2025-05-07T20:27:09.7274602Z #define __DEC128_MIN__ 1E-6143DL 2025-05-07T20:27:09.7274900Z #define __cudaCDP2EventRecordWithFlags 2025-05-07T20:27:09.7275229Z #define _GLIBCXX_ATOMIC_BUILTINS 1 2025-05-07T20:27:09.7275545Z #define cudaPeerAccessDefault 0x00 2025-05-07T20:27:09.7275838Z #define __REGISTER_PREFIX__ 2025-05-07T20:27:09.7276112Z #define __UINT16_MAX__ 0xffff 2025-05-07T20:27:09.7276467Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 2025-05-07T20:27:09.7276835Z #define _IOS_NOREPLACE 64 2025-05-07T20:27:09.7277086Z #define __cdecl 2025-05-07T20:27:09.7277333Z #define cudaEventInterprocess 0x04 2025-05-07T20:27:09.7277670Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L 2025-05-07T20:27:09.7278016Z #define LOGIN_NAME_MAX 256 2025-05-07T20:27:09.7278293Z #define _IO_TIED_PUT_GET 0x400 2025-05-07T20:27:09.7278568Z #define X_TLOSS 1.41484755040568800000e+16 2025-05-07T20:27:09.7278874Z #define CUDA_IPC_HANDLE_SIZE 64 2025-05-07T20:27:09.7279154Z #define __LDBL_HAS_INFINITY__ 1 2025-05-07T20:27:09.7279475Z #define __attribute_pure__ __attribute__ ((__pure__)) 2025-05-07T20:27:09.7279817Z #define __TEXTURE_TYPES_H__ 2025-05-07T20:27:09.7280239Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:27:09.7280696Z #define ADJ_NANO 0x2000 2025-05-07T20:27:09.7281012Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32 2025-05-07T20:27:09.7281388Z #define __UINT8_TYPE__ unsigned char 2025-05-07T20:27:09.7281691Z #define _GLIBCXX_HAVE_ISWBLANK 1 2025-05-07T20:27:09.7281960Z #define __FLT_DIG__ 6 2025-05-07T20:27:09.7282323Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias) 2025-05-07T20:27:09.7282738Z #define __NO_INLINE__ 1 2025-05-07T20:27:09.7283158Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800) 2025-05-07T20:27:09.7283516Z #define _POSIX_NGROUPS_MAX 8 2025-05-07T20:27:09.7283781Z #define ADJ_STATUS 0x0010 2025-05-07T20:27:09.7284047Z #define __cudaCDP2MemcpyAsync_ptsz 2025-05-07T20:27:09.7284336Z #define CLOCK_BOOTTIME_ALARM 9 2025-05-07T20:27:09.7284611Z #define LONG_LONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:27:09.7284918Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1 2025-05-07T20:27:09.7285207Z #define __DEC_EVAL_METHOD__ 2 2025-05-07T20:27:09.7285710Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000 2025-05-07T20:27:09.7288062Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1 2025-05-07T20:27:09.7288421Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL 2025-05-07T20:27:09.7288784Z #define CHAR_MIN SCHAR_MIN 2025-05-07T20:27:09.7289044Z #define MAX_CANON 255 2025-05-07T20:27:09.7289284Z #define __FLT_MANT_DIG__ 24 2025-05-07T20:27:09.7289549Z #define __LDBL_DECIMAL_DIG__ 21 2025-05-07T20:27:09.7289835Z #define _GLIBCXX_HAVE_COMPLEX_H 1 2025-05-07T20:27:09.7290152Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 2025-05-07T20:27:09.7290469Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX 2025-05-07T20:27:09.7290786Z #define _GLIBCXX_HAVE_HYPOT 1 2025-05-07T20:27:09.7291082Z #define __cudaCDP2Memset2DAsync_ptsz 2025-05-07T20:27:09.7291414Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1 2025-05-07T20:27:09.7291748Z #define __VERSION__ "11.4.0" 2025-05-07T20:27:09.7292029Z #define _GLIBCXX11_USE_C99_STDLIB 1 2025-05-07T20:27:09.7292337Z #define cudaHostRegisterMapped 0x02 2025-05-07T20:27:09.7292644Z #define _GLIBCXX_HAVE_INT64_T 1 2025-05-07T20:27:09.7292948Z #define _GLIBCXX_USE_CONSTEXPR constexpr 2025-05-07T20:27:09.7293269Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp) 2025-05-07T20:27:09.7293586Z #define __UINT64_C(c) c ## UL 2025-05-07T20:27:09.7293858Z #define MOD_OFFSET ADJ_OFFSET 2025-05-07T20:27:09.7294126Z #define _SYS_TYPES_H 1 2025-05-07T20:27:09.7294372Z #define AIO_PRIO_DELTA_MAX 20 2025-05-07T20:27:09.7294760Z #define _GLIBCXX_HAVE_TANHF 1 2025-05-07T20:27:09.7295029Z #define _SYS_CDEFS_H 1 2025-05-07T20:27:09.7295268Z #define _GLIBCXX_HAVE_TANHL 1 2025-05-07T20:27:09.7295554Z #define __cpp_unicode_characters 201411L 2025-05-07T20:27:09.7295860Z #define _IO_ERR_SEEN 0x20 2025-05-07T20:27:09.7296120Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1 2025-05-07T20:27:09.7296426Z #define __cudaCDP2StreamDestroy 2025-05-07T20:27:09.7296710Z #define FP_SUBNORMAL 3 2025-05-07T20:27:09.7296965Z #define cudaOccupancyDefault 0x00 2025-05-07T20:27:09.7297261Z #define _INITIALIZER_LIST 2025-05-07T20:27:09.7297523Z #define _STDC_PREDEF_H 1 2025-05-07T20:27:09.7297783Z #define __CUDA_RUNTIME_API_H__ 2025-05-07T20:27:09.7298077Z #define _GLIBCXX_PACKAGE_BUGREPORT "" 2025-05-07T20:27:09.7298375Z #define _GLIBCXX_HAVE_MODF 1 2025-05-07T20:27:09.7298639Z #define _IO_file_flags _flags 2025-05-07T20:27:09.7298907Z #define __USE_XOPEN2K8 1 2025-05-07T20:27:09.7299165Z #define htobe64(x) __bswap_64 (x) 2025-05-07T20:27:09.7299450Z #define _OLD_STDIO_MAGIC 0xFABC0000 2025-05-07T20:27:09.7299737Z #define HUGE 3.40282347e+38F 2025-05-07T20:27:09.7300014Z #define __cpp_lib_is_null_pointer 201309 2025-05-07T20:27:09.7300405Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status)) 2025-05-07T20:27:09.7300809Z #define islower_l(c,l) __islower_l ((c), (l)) 2025-05-07T20:27:09.7301132Z #define _GLIBCXX_USE_CXX11_ABI 1 2025-05-07T20:27:09.7301417Z #define _GLIBCXX_HAVE_SYMLINK 1 2025-05-07T20:27:09.7301682Z #define _BSD_SOURCE 1 2025-05-07T20:27:09.7301933Z #define _GLIBCXX_THROW(_EXC) 2025-05-07T20:27:09.7302822Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template> struct __has_ ##_NTYPE : false_type { }; template struct __has_ ##_NTYPE<_Tp, __void_t> : true_type { }; 2025-05-07T20:27:09.7303742Z #define __catch(X) catch(X) 2025-05-07T20:27:09.7304010Z #define __INT_LEAST32_MAX__ 0x7fffffff 2025-05-07T20:27:09.7304323Z #define LINE_MAX _POSIX2_LINE_MAX 2025-05-07T20:27:09.7304615Z #define __TIMER_T_TYPE void * 2025-05-07T20:27:09.7304985Z #define __STRING(x) #x 2025-05-07T20:27:09.7305247Z #define __GCC_ATOMIC_INT_LOCK_FREE 2 2025-05-07T20:27:09.7305541Z #define _T_PTRDIFF_ 2025-05-07T20:27:09.7305791Z #define _GLIBCXX_USE_NOEXCEPT noexcept 2025-05-07T20:27:09.7306113Z #define cudaEventWaitExternal 0x01 2025-05-07T20:27:09.7306409Z #define __unbounded 2025-05-07T20:27:09.7306658Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:09.7306978Z #define __FLT128_MAX_EXP__ 16384 2025-05-07T20:27:09.7307274Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:09.7307594Z #define be16toh(x) __bswap_16 (x) 2025-05-07T20:27:09.7307973Z #define __cpp_lib_is_final 201402L 2025-05-07T20:27:09.7308288Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 2025-05-07T20:27:09.7308633Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL) 2025-05-07T20:27:09.7308947Z #define __MATH_DECLARE_LDOUBLE 1 2025-05-07T20:27:09.7309238Z #define __managed__ __location__(managed) 2025-05-07T20:27:09.7309556Z #define _POSIX2_EXPR_NEST_MAX 32 2025-05-07T20:27:09.7309969Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min)) 2025-05-07T20:27:09.7310409Z #define _POSIX_STREAM_MAX 8 2025-05-07T20:27:09.7310675Z #define __LIBRARY_TYPES_H__ 2025-05-07T20:27:09.7311058Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11 2025-05-07T20:27:09.7311470Z #define __FLT32_MANT_DIG__ 24 2025-05-07T20:27:09.7311727Z #define _SYS_SIZE_T_H 2025-05-07T20:27:09.7312027Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10) 2025-05-07T20:27:09.7312370Z #define _GLIBCXX_STDLIB_H 1 2025-05-07T20:27:09.7312662Z #define isupper_l(c,l) __isupper_l ((c), (l)) 2025-05-07T20:27:09.7312966Z #define _CRTIMP 2025-05-07T20:27:09.7313189Z #define _GLIBCXX_CXX_CONFIG_H 1 2025-05-07T20:27:09.7313500Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:27:09.7313836Z #define STA_PPSJITTER 0x0200 2025-05-07T20:27:09.7314194Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0) 2025-05-07T20:27:09.7314618Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:09.7314943Z #define _GLIBCXX_HAVE_ISINFF 1 2025-05-07T20:27:09.7315232Z #define __glibcxx_requires_subscript(_N) 2025-05-07T20:27:09.7315523Z #define __SIZE_T__ 2025-05-07T20:27:09.7315741Z #define __stub_gtty 2025-05-07T20:27:09.7315971Z #define __pid_t_defined 2025-05-07T20:27:09.7316223Z #define _GLIBCXX_FWDREF(_Tp) _Tp&& 2025-05-07T20:27:09.7316541Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:09.7318347Z #define __glibcxx_function_requires(...) 2025-05-07T20:27:09.7318642Z #define __SM_80_RT_HPP__ 2025-05-07T20:27:09.7318900Z #define __need_clockid_t 2025-05-07T20:27:09.7319223Z #define SSIZE_MAX LONG_MAX 2025-05-07T20:27:09.7319680Z #define _GLIBCXX_HAVE_USELOCALE 1 2025-05-07T20:27:09.7320009Z #define __glibcxx_requires_string_len(_String,_Len) 2025-05-07T20:27:09.7320345Z #define _IO_HEX 0100 2025-05-07T20:27:09.7320603Z #define __NFDBITS (8 * (int) sizeof (__fd_mask)) 2025-05-07T20:27:09.7320954Z #define cudaExternalMemoryDedicated 0x1 2025-05-07T20:27:09.7321272Z #define _GLIBCXX_HAVE_TGMATH_H 1 2025-05-07T20:27:09.7321551Z #define _GLIBCXX11_USE_C99_COMPLEX 1 2025-05-07T20:27:09.7321964Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:27:09.7322440Z #define ispunct_l(c,l) __ispunct_l ((c), (l)) 2025-05-07T20:27:09.7322801Z #define __cpp_aggregate_bases 201603L 2025-05-07T20:27:09.7323096Z #define __cudaGet_blockDim() blockDim 2025-05-07T20:27:09.7323215Z #define __cudaCDP2Memcpy3DAsync 2025-05-07T20:27:09.7323322Z #define __cudaCDP2MemcpyAsync 2025-05-07T20:27:09.7323417Z #define __stub_sstk 2025-05-07T20:27:09.7323516Z #define _IO_IN_BACKUP 0x100 2025-05-07T20:27:09.7323672Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB 2025-05-07T20:27:09.7323758Z #define __wur 2025-05-07T20:27:09.7323886Z #define isprint_l(c,l) __isprint_l ((c), (l)) 2025-05-07T20:27:09.7323973Z #define _G_HAVE_MMAP 1 2025-05-07T20:27:09.7324064Z #define _IO_OCT 040 2025-05-07T20:27:09.7324305Z #define __FLT128_HAS_DENORM__ 1 2025-05-07T20:27:09.7324401Z #define NL_MSGMAX INT_MAX 2025-05-07T20:27:09.7324503Z #define _GLIBCXX_USE_LFS 1 2025-05-07T20:27:09.7324632Z #define cudaDeviceScheduleBlockingSync 0x04 2025-05-07T20:27:09.7324725Z #define _POSIX_RTSIG_MAX 8 2025-05-07T20:27:09.7324837Z #define _GLIBCXX_NOEXCEPT noexcept 2025-05-07T20:27:09.7325028Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 2025-05-07T20:27:09.7325124Z #define __FLT32_DECIMAL_DIG__ 9 2025-05-07T20:27:09.7325222Z #define _STL_ALGOBASE_H 1 2025-05-07T20:27:09.7325674Z #define __cudaCDP2MemsetAsync_ptsz 2025-05-07T20:27:09.7325806Z #define __off64_t_defined 2025-05-07T20:27:09.7325916Z #define _GLIBCXX_WEAK_DEFINITION 2025-05-07T20:27:09.7326006Z #define __FLT128_DIG__ 33 2025-05-07T20:27:09.7326119Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1 2025-05-07T20:27:09.7326221Z #define _GLIBCXX_HAVE_LOCALE_H 1 2025-05-07T20:27:09.7326307Z #define __INT32_C(c) c 2025-05-07T20:27:09.7326418Z #define __DEC64_EPSILON__ 1E-15DD 2025-05-07T20:27:09.7326521Z #define __ORDER_PDP_ENDIAN__ 3412 2025-05-07T20:27:09.7326619Z #define __DEC128_MIN_EXP__ (-6142) 2025-05-07T20:27:09.7326719Z #define __PDP_ENDIAN 3412 2025-05-07T20:27:09.7326809Z #define _ISOC95_SOURCE 1 2025-05-07T20:27:09.7326908Z #define _IO_fpos64_t _G_fpos64_t 2025-05-07T20:27:09.7327047Z #define M_PI_2l 1.570796326794896619231321691639751442L 2025-05-07T20:27:09.7327146Z #define BYTE_ORDER __BYTE_ORDER 2025-05-07T20:27:09.7327236Z #define __SM_90_RT_HPP__ 2025-05-07T20:27:09.7327341Z #define __INT_FAST32_TYPE__ long int 2025-05-07T20:27:09.7327444Z #define __have_pthread_attr_t 1 2025-05-07T20:27:09.7327554Z #define _GLIBCXX_HAVE_LIMIT_DATA 1 2025-05-07T20:27:09.7327778Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11 2025-05-07T20:27:09.7327889Z #define __cudaCDP2StreamWaitEvent 2025-05-07T20:27:09.7327997Z #define __cudaCDP2EventRecord 2025-05-07T20:27:09.7328091Z #define _BITS_TYPESIZES_H 1 2025-05-07T20:27:09.7328183Z #define htole32(x) (x) 2025-05-07T20:27:09.7328447Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 2025-05-07T20:27:09.7328570Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE 2025-05-07T20:27:09.7328671Z #define _GLIBCXX_USE_C99_MATH_TR1 1 2025-05-07T20:27:09.7328838Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status)) 2025-05-07T20:27:09.7328978Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH 2025-05-07T20:27:09.7329112Z #define __UINT_LEAST16_TYPE__ short unsigned int 2025-05-07T20:27:09.7329253Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0) 2025-05-07T20:27:09.7329352Z #define ADJ_OFFSET 0x0001 2025-05-07T20:27:09.7329460Z #define cudaArrayLayered 0x01 2025-05-07T20:27:09.7329627Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800) 2025-05-07T20:27:09.7329741Z #define cudaEventRecordDefault 0x00 2025-05-07T20:27:09.7329844Z #define _GLIBCXX_HAVE_FMODF 1 2025-05-07T20:27:09.7329944Z #define _PSTL_PRAGMA_MESSAGE(x) 2025-05-07T20:27:09.7330033Z #define unix 1 2025-05-07T20:27:09.7330134Z #define __DBL_HAS_DENORM__ 1 2025-05-07T20:27:09.7330227Z #define _POSIX_CHILD_MAX 25 2025-05-07T20:27:09.7330320Z #define _POSIX_MAX_INPUT 255 2025-05-07T20:27:09.7330447Z #define __cudaCDP2DeviceGetCacheConfig 2025-05-07T20:27:09.7330534Z #define __USE_POSIX 1 2025-05-07T20:27:09.7330636Z #define __FD_ZERO_STOS "stosq" 2025-05-07T20:27:09.7330768Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000) 2025-05-07T20:27:09.7330862Z #define __THROWNL throw () 2025-05-07T20:27:09.7330963Z #define __cpp_rtti 199711L 2025-05-07T20:27:09.7331073Z #define __SIZE_TYPE__ long unsigned int 2025-05-07T20:27:09.7331162Z #define __PMT(args) args 2025-05-07T20:27:09.7331285Z #define __UINT64_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:09.7331436Z #define __va_arg_pack_len() __builtin_va_arg_pack_len () 2025-05-07T20:27:09.7331551Z #define __ULONGWORD_TYPE unsigned long int 2025-05-07T20:27:09.7331648Z #define _SIZE_T_DECLARED 2025-05-07T20:27:09.7331984Z #define _PSTL_STRING_AUX(x) #x 2025-05-07T20:27:09.7332088Z #define __FLT_IS_IEC_60559__ 2 2025-05-07T20:27:09.7332485Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402) 2025-05-07T20:27:09.7332587Z #define _GLIBCXX_HAVE_LIMIT_AS 1 2025-05-07T20:27:09.7332689Z #define XATTR_LIST_MAX 65536 2025-05-07T20:27:09.7332786Z #define __CUDACC_VER_MAJOR__ 12 2025-05-07T20:27:09.7332929Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE" 2025-05-07T20:27:09.7333018Z #define _WCHAR_T_H 2025-05-07T20:27:09.7333110Z #define __FLT64X_DIG__ 18 2025-05-07T20:27:09.7333328Z #define _IO_SHOWBASE 0200 2025-05-07T20:27:09.7333421Z #define _POSIX_QLIMIT 1 2025-05-07T20:27:09.7333521Z #define __INT8_TYPE__ signed char 2025-05-07T20:27:09.7333618Z #define __SURFACE_TYPES_H__ 2025-05-07T20:27:09.7333712Z #define __CUDA_ARCH__ 520 2025-05-07T20:27:09.7333823Z #define __cpp_digit_separators 201309L 2025-05-07T20:27:09.7333913Z #define __ELF__ 1 2025-05-07T20:27:09.7334019Z #define CLOCK_THREAD_CPUTIME_ID 3 2025-05-07T20:27:09.7334123Z #define __GCC_ASM_FLAG_OUTPUTS__ 1 2025-05-07T20:27:09.7334215Z #define STA_INS 0x0010 2025-05-07T20:27:09.7334316Z #define __UINT32_TYPE__ unsigned int 2025-05-07T20:27:09.7334487Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)]) 2025-05-07T20:27:09.7334692Z #define _BITS_BYTESWAP_H 1 2025-05-07T20:27:09.7334797Z #define __ID_T_TYPE __U32_TYPE 2025-05-07T20:27:09.7334908Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:09.7335024Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 2025-05-07T20:27:09.7335127Z #define _GLIBCXX_HAVE_MBSTATE_T 1 2025-05-07T20:27:09.7335237Z #define __cpp_lib_logical_traits 201510 2025-05-07T20:27:09.7335335Z #define ADJ_OFFSET_SS_READ 0xa001 2025-05-07T20:27:09.7335490Z #define __warnattr(msg) __attribute__((__warning__ (msg))) 2025-05-07T20:27:09.7335657Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: " 2025-05-07T20:27:09.7335757Z #define _IO_funlockfile(_fp) 2025-05-07T20:27:09.7336092Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:27:09.7336231Z #define M_2_PIl 0.636619772367581343075535053490057448L 2025-05-07T20:27:09.7336323Z #define __DRIVER_TYPES_H__ 2025-05-07T20:27:09.7336410Z #define __FLT_RADIX__ 2 2025-05-07T20:27:09.7336521Z #define __INT_LEAST16_TYPE__ short int 2025-05-07T20:27:09.7336690Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L 2025-05-07T20:27:09.7336795Z #define __UINTMAX_C(c) c ## UL 2025-05-07T20:27:09.7336889Z #define _GLIBCXX_USE_LSTAT 1 2025-05-07T20:27:09.7337000Z #define minor(dev) gnu_dev_minor (dev) 2025-05-07T20:27:09.7337105Z #define _POSIX_C_SOURCE 200809L 2025-05-07T20:27:09.7337203Z #define _GLIBCXX_HAVE_DIRENT_H 1 2025-05-07T20:27:09.7337305Z #define __GLIBCXX_BITSIZE_INT_N_0 128 2025-05-07T20:27:09.7337401Z #define WORD_BIT 32 2025-05-07T20:27:09.7337489Z #define _IO_USER_BUF 1 2025-05-07T20:27:09.7337586Z #define __VECTOR_TYPES_H__ 2025-05-07T20:27:09.7337706Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:09.7337814Z #define cudaHostAllocPortable 0x01 2025-05-07T20:27:09.7337913Z #define PTHREAD_STACK_MIN 16384 2025-05-07T20:27:09.7338028Z #define __long_double_t long double 2025-05-07T20:27:09.7338124Z #define _GLIBCXX_HAVE_ISINF 1 2025-05-07T20:27:09.7338223Z #define _POSIX_ARG_MAX 4096 2025-05-07T20:27:09.7338625Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode 2025-05-07T20:27:09.7338709Z #define __k8 1 2025-05-07T20:27:09.7338913Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23) 2025-05-07T20:27:09.7339094Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x 2025-05-07T20:27:09.7339211Z #define __LDBL_REDIR(name,proto) name proto 2025-05-07T20:27:09.7339322Z #define __SIG_ATOMIC_MAX__ 0x7fffffff 2025-05-07T20:27:09.7339421Z #define __SM_30_INTRINSICS_HPP__ 2025-05-07T20:27:09.7339522Z #define _GLIBCXX_EXTERN_TEMPLATE 1 2025-05-07T20:27:09.7339828Z #define __blksize_t_defined 2025-05-07T20:27:09.7339932Z #define _IO_SHOWPOINT 0400 2025-05-07T20:27:09.7340034Z #define _GLIBCXX_HAVE_LIMIT_RSS 1 2025-05-07T20:27:09.7340148Z #define cudaDeviceLmemResizeToMax 0x10 2025-05-07T20:27:09.7340243Z #define _GLIBCXX_X86_RDRAND 1 2025-05-07T20:27:09.7340358Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2 2025-05-07T20:27:09.7340454Z #define _IO_IS_FILEBUF 0x2000 2025-05-07T20:27:09.7340550Z #define _GLIBCXX_USE_DUAL_ABI 1 2025-05-07T20:27:09.7340819Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8))) 2025-05-07T20:27:09.7341248Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2) 2025-05-07T20:27:09.7341354Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1) 2025-05-07T20:27:09.7341459Z #define __SIZEOF_PTRDIFF_T__ 8 2025-05-07T20:27:09.7341547Z #define SEEK_SET 0 2025-05-07T20:27:09.7341653Z #define _GLIBCXX_TR1_GAMMA_TCC 1 2025-05-07T20:27:09.7341754Z #define __CUDA_API_VER_MINOR__ 6 2025-05-07T20:27:09.7341957Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V))) 2025-05-07T20:27:09.7342070Z #define _GLIBCXX20_DEPRECATED(MSG) 2025-05-07T20:27:09.7342175Z #define __cudaCDP2GetLastError 2025-05-07T20:27:09.7342271Z #define _GLIBCXX_HAVE_COSL 1 2025-05-07T20:27:09.7342374Z #define _MATH_H_MATHDEF 1 2025-05-07T20:27:09.7342700Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) 2025-05-07T20:27:09.7342805Z #define _GLIBCXX_USE_FLOAT128 1 2025-05-07T20:27:09.7342910Z #define _IO_FLAGS2_NOTCANCEL 2 2025-05-07T20:27:09.7343009Z #define __stub_sigreturn 2025-05-07T20:27:09.7343266Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg))) 2025-05-07T20:27:09.7343364Z #define _GLIBCXX_HAVE_UTIME_H 1 2025-05-07T20:27:09.7343459Z #define __HOST_CONFIG_H__ 2025-05-07T20:27:09.7343573Z #define _XOPEN_SOURCE_EXTENDED 1 2025-05-07T20:27:09.7343662Z #define CLOCK_TAI 11 2025-05-07T20:27:09.7343779Z #define _GLIBCXX_END_NAMESPACE_VERSION 2025-05-07T20:27:09.7343879Z #define __restrict_arr 2025-05-07T20:27:09.7343999Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 2025-05-07T20:27:09.7344145Z #define __glibcxx_requires_valid_range(_First,_Last) 2025-05-07T20:27:09.7344698Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:27:09.7344891Z #define __attribute_artificial__ __attribute__ ((__artificial__)) 2025-05-07T20:27:09.7345023Z #define __USE_MISC 1 2025-05-07T20:27:09.7345163Z #define __UWORD_TYPE unsigned long int 2025-05-07T20:27:09.7345296Z #define _EXCEPTION_DEFINES_H 1 2025-05-07T20:27:09.7345395Z #define _GCC_LIMITS_H_ 2025-05-07T20:27:09.7345624Z #define __LDBL_DIG__ 18 2025-05-07T20:27:09.7345722Z #define __BIT_TYPES_DEFINED__ 1 2025-05-07T20:27:09.7345829Z #define __malloc_and_calloc_defined 2025-05-07T20:27:09.7345929Z #define __FLT64_IS_IEC_60559__ 2 2025-05-07T20:27:09.7346029Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1 2025-05-07T20:27:09.7346115Z #define __x86_64__ 1 2025-05-07T20:27:09.7346196Z #define _SIZE_T_ 2025-05-07T20:27:09.7347091Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56))) 2025-05-07T20:27:09.7347199Z #define _POSIX2_COLL_WEIGHTS_MAX 2 2025-05-07T20:27:09.7347294Z #define __FLT32X_MIN_EXP__ (-1021) 2025-05-07T20:27:09.7347415Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1 2025-05-07T20:27:09.7347533Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF 2025-05-07T20:27:09.7347631Z #define _IO_iconv_t _G_iconv_t 2025-05-07T20:27:09.7347745Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1 2025-05-07T20:27:09.7347966Z #define __cpp_lib_make_reverse_iterator 201402 2025-05-07T20:27:09.7348121Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 2025-05-07T20:27:09.7348255Z #define _GLIBCXX_HAVE_DLFCN_H 1 2025-05-07T20:27:09.7348889Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); })) 2025-05-07T20:27:09.7349024Z #define __no_return__ __attribute__((noreturn)) 2025-05-07T20:27:09.7349170Z #define __device_builtin__ __location__(device_builtin) 2025-05-07T20:27:09.7349371Z #define _PSTL_HIDE_FROM_ABI_POP 2025-05-07T20:27:09.7349476Z #define _GLIBCXX_HAVE_ACOSF 1 2025-05-07T20:27:09.7349566Z #define STA_FLL 0x0008 2025-05-07T20:27:09.7349718Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1 2025-05-07T20:27:09.7349819Z #define _GLIBCXX_END_EXTERN_C } 2025-05-07T20:27:09.7349948Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:09.7350074Z #define __cpp_lib_integer_sequence 201304 2025-05-07T20:27:09.7350165Z #define __stub_revoke 2025-05-07T20:27:09.7350260Z #define __timer_t_defined 1 2025-05-07T20:27:09.7350412Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED 2025-05-07T20:27:09.7350509Z #define INT_MAX __INT_MAX__ 2025-05-07T20:27:09.7350620Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1) 2025-05-07T20:27:09.7350743Z #define _GLIBCXX_END_NAMESPACE_CXX11 } 2025-05-07T20:27:09.7350845Z #define _GLIBCXX_ICONV_CONST 2025-05-07T20:27:09.7350951Z #define major(dev) gnu_dev_major (dev) 2025-05-07T20:27:09.7351074Z #define cudaArrayTextureGather 0x08 2025-05-07T20:27:09.7351183Z #define _GLIBCXX_LT_OBJDIR ".libs/" 2025-05-07T20:27:09.7351344Z #define __inline_hint__ __attribute__((nv_inline_hint)) 2025-05-07T20:27:09.7351446Z #define __NV_LEGACY_LAUNCH 1 2025-05-07T20:27:09.7351540Z #define _IO_off_t __off_t 2025-05-07T20:27:09.7351637Z #define __FLT64_DIG__ 15 2025-05-07T20:27:09.7351869Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS 2025-05-07T20:27:09.7351970Z #define _POSIX2_LINE_MAX 2048 2025-05-07T20:27:09.7352111Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:09.7352237Z #define __UINT_LEAST64_TYPE__ long unsigned int 2025-05-07T20:27:09.7352337Z #define ADJ_FREQUENCY 0x0002 2025-05-07T20:27:09.7352453Z #define __CUDART_API_PTDS(api) api 2025-05-07T20:27:09.7352542Z #define NULL __null 2025-05-07T20:27:09.7352684Z #define cudaStreamPerThread ((cudaStream_t)0x2) 2025-05-07T20:27:09.7352793Z #define _GLIBCXX_CONSTEXPR constexpr 2025-05-07T20:27:09.7352898Z #define __U64_TYPE unsigned long int 2025-05-07T20:27:09.7353010Z #define __FLT_HAS_QUIET_NAN__ 1 2025-05-07T20:27:09.7353109Z #define __FLT_MAX_10_EXP__ 38 2025-05-07T20:27:09.7353196Z #define FP_ZERO 2 2025-05-07T20:27:09.7353302Z #define _GLIBCXX_HAVE_FLOORL 1 2025-05-07T20:27:09.7353458Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l)) 2025-05-07T20:27:09.7353571Z #define __LONG_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:09.7353671Z #define __WCHAR_T__ 2025-05-07T20:27:09.7353772Z #define __FLT64X_HAS_DENORM__ 1 2025-05-07T20:27:09.7353968Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL 2025-05-07T20:27:09.7354127Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__)) 2025-05-07T20:27:09.7354224Z #define __FLT_HAS_INFINITY__ 1 2025-05-07T20:27:09.7354350Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8" 2025-05-07T20:27:09.7354465Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 2025-05-07T20:27:09.7354593Z #define __WSTOPSIG(status) __WEXITSTATUS(status) 2025-05-07T20:27:09.7354724Z #define cudaSurfaceTypeCubemapLayered 0xFC 2025-05-07T20:27:09.7354821Z #define _BSD_PTRDIFF_T_ 2025-05-07T20:27:09.7354910Z #define _SIGSET_H_types 1 2025-05-07T20:27:09.7355029Z #define cudaTextureType1DLayered 0xF1 2025-05-07T20:27:09.7355133Z #define __cpp_unicode_literals 200710L 2025-05-07T20:27:09.7355280Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l)) 2025-05-07T20:27:09.7355390Z #define __LONG_LONG_PAIR(HI,LO) LO, HI 2025-05-07T20:27:09.7355632Z #define __UINT_FAST16_TYPE__ long unsigned int 2025-05-07T20:27:09.7355784Z #define __bos0(ptr) __builtin_object_size (ptr, 0) 2025-05-07T20:27:09.7355991Z #define __DEC64_MAX__ 9.999999999999999E384DD 2025-05-07T20:27:09.7356191Z #define M_1_PIl 0.318309886183790671537767526745028724L 2025-05-07T20:27:09.7356401Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status)) 2025-05-07T20:27:09.7368166Z #define __INT_FAST32_WIDTH__ 64 2025-05-07T20:27:09.7368312Z #define _POSIX2_CHARCLASS_NAME_MAX 14 2025-05-07T20:27:09.7368411Z #define _GLIBCXX_BITS_STD_ABS_H 2025-05-07T20:27:09.7368643Z #define STA_MODE 0x4000 2025-05-07T20:27:09.7368757Z #define __CHAR16_TYPE__ short unsigned int 2025-05-07T20:27:09.7368862Z #define __PRAGMA_REDEFINE_EXTNAME 1 2025-05-07T20:27:09.7368994Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0) 2025-05-07T20:27:09.7369100Z #define __USING_NAMESPACE_C99(name) 2025-05-07T20:27:09.7369199Z #define BIG_ENDIAN __BIG_ENDIAN 2025-05-07T20:27:09.7369320Z #define __cudaCDP2EventRecord_ptsz 2025-05-07T20:27:09.7369419Z #define _GLIBCXX_HAVE_SINL 1 2025-05-07T20:27:09.7369535Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX 2025-05-07T20:27:09.7369634Z #define __SIZE_WIDTH__ 64 2025-05-07T20:27:09.7369749Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:09.7369838Z #define __SEG_FS 1 2025-05-07T20:27:09.7369924Z #define _IO_size_t size_t 2025-05-07T20:27:09.7370033Z #define __INT_LEAST16_MAX__ 0x7fff 2025-05-07T20:27:09.7370135Z #define INT_MIN (-INT_MAX - 1) 2025-05-07T20:27:09.7370225Z #define __stub_lchmod 2025-05-07T20:27:09.7370335Z #define __DEC64_MANT_DIG__ 16 2025-05-07T20:27:09.7370446Z #define __INT64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:09.7370554Z #define _GLIBCXX_MANGLE_SIZE_T m 2025-05-07T20:27:09.7370638Z #define __SEG_GS 1 2025-05-07T20:27:09.7370823Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32 2025-05-07T20:27:09.7370925Z #define _IOS_APPEND 8 2025-05-07T20:27:09.7371022Z #define __SIG_ATOMIC_WIDTH__ 32 2025-05-07T20:27:09.7371120Z #define _GLIBCXX_RELEASE 11 2025-05-07T20:27:09.7371228Z #define _GLIBCXX98_USE_C99_WCHAR 1 2025-05-07T20:27:09.7371328Z #define _IO_IS_APPENDING 0x1000 2025-05-07T20:27:09.7371430Z #define __INT_LEAST64_TYPE__ long int 2025-05-07T20:27:09.7371529Z #define htole16(x) (x) 2025-05-07T20:27:09.7371642Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:27:09.7371738Z #define _GLIBCXX_HAVE_FCNTL_H 1 2025-05-07T20:27:09.7371842Z #define __INT16_TYPE__ short int 2025-05-07T20:27:09.7371946Z #define __INT_LEAST8_TYPE__ signed char 2025-05-07T20:27:09.7372068Z #define __glibcxx_class_requires(_a,_b) 2025-05-07T20:27:09.7372184Z #define __cpp_structured_bindings 201606L 2025-05-07T20:27:09.7372309Z #define __align__(n) __attribute__((aligned(n))) 2025-05-07T20:27:09.7372410Z #define __SIZEOF_INT__ 4 2025-05-07T20:27:09.7372503Z #define __WCLONE 0x80000000 2025-05-07T20:27:09.7372598Z #define __DEC32_MAX_EXP__ 97 2025-05-07T20:27:09.7372693Z #define SEEK_HOLE 4 2025-05-07T20:27:09.7372787Z #define TIMER_ABSTIME 1 2025-05-07T20:27:09.7372883Z #define __INT_FAST8_MAX__ 0x7f 2025-05-07T20:27:09.7372982Z #define __CUDA_MATH_CRTIMP 2025-05-07T20:27:09.7373158Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:27:09.7373275Z #define __INTPTR_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:09.7373382Z #define __DRIVER_FUNCTIONS_H__ 2025-05-07T20:27:09.7373494Z #define __cpp_sized_deallocation 201309L 2025-05-07T20:27:09.7373602Z #define __MATH_FUNCTIONS_HPP__ 2025-05-07T20:27:09.7373726Z #define __cpp_guaranteed_copy_elision 201606L 2025-05-07T20:27:09.7373824Z #define _LINUX_LIMITS_H 2025-05-07T20:27:09.7373915Z #define linux 1 2025-05-07T20:27:09.7374008Z #define MOD_MICRO ADJ_MICRO 2025-05-07T20:27:09.7374120Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 2025-05-07T20:27:09.7374224Z #define _GLIBCXX_HAVE_VSWSCANF 1 2025-05-07T20:27:09.7374319Z #define _GLIBCXX_HAVE_ISNAN 1 2025-05-07T20:27:09.7374435Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV 2025-05-07T20:27:09.7374856Z #define __cudart_builtin__ __location__(cudart_builtin) 2025-05-07T20:27:09.7374964Z #define __cpp_lib_hypot 201603 2025-05-07T20:27:09.7375067Z #define __FLT64_HAS_QUIET_NAN__ 1 2025-05-07T20:27:09.7375164Z #define _GLIBCXX_HAVE_WCTYPE_H 1 2025-05-07T20:27:09.7375254Z #define MOD_NANO ADJ_NANO 2025-05-07T20:27:09.7375347Z #define htole64(x) (x) 2025-05-07T20:27:09.7375449Z #define FP_ILOGBNAN (-2147483647 - 1) 2025-05-07T20:27:09.7375582Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_)) 2025-05-07T20:27:09.7375678Z #define _IO_UPPERCASE 01000 2025-05-07T20:27:09.7376259Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference 2025-05-07T20:27:09.7376358Z #define __USE_POSIX2 1 2025-05-07T20:27:09.7376460Z #define MOD_ESTERROR ADJ_ESTERROR 2025-05-07T20:27:09.7376551Z #define __WALL 0x40000000 2025-05-07T20:27:09.7376657Z #define _GLIBCXX_HAVE_LDEXPF 1 2025-05-07T20:27:09.7376745Z #define _XLOCALE_H 1 2025-05-07T20:27:09.7376854Z #define _GLIBCXX_USE_TMPNAM 1 2025-05-07T20:27:09.7376973Z #define __FLT32_MIN_10_EXP__ (-37) 2025-05-07T20:27:09.7377102Z #define __KEY_T_TYPE __S32_TYPE 2025-05-07T20:27:09.7377247Z #define __cudaGet_threadIdx() threadIdx 2025-05-07T20:27:09.7377371Z #define __EXCEPTIONS 1 2025-05-07T20:27:09.7377504Z #define __CUDART_API_PTSZ(api) api 2025-05-07T20:27:09.7377707Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__)) 2025-05-07T20:27:09.7377794Z #define __WORDSIZE 64 2025-05-07T20:27:09.7377888Z #define CLOCK_MONOTONIC 1 2025-05-07T20:27:09.7377985Z #define _STL_RELOPS_H 1 2025-05-07T20:27:09.7378086Z #define __PTRDIFF_WIDTH__ 64 2025-05-07T20:27:09.7378184Z #define __BEGIN_DECLS extern "C" { 2025-05-07T20:27:09.7378289Z #define _GLIBCXX_HAVE_SYS_IPC_H 1 2025-05-07T20:27:09.7378385Z #define __LDBL_MANT_DIG__ 64 2025-05-07T20:27:09.7378487Z #define _GLIBCXX_HAVE_TRUNCATE 1 2025-05-07T20:27:09.7378798Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension 2025-05-07T20:27:09.7379039Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) 2025-05-07T20:27:09.7379187Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11:: 2025-05-07T20:27:09.7379288Z #define _GLIBCXX_NUMERIC_LIMITS 1 2025-05-07T20:27:09.7379396Z #define __cpp_range_based_for 201603L 2025-05-07T20:27:09.7379516Z #define __cpp_lib_exchange_function 201304 2025-05-07T20:27:09.7379618Z #define _GLIBCXX_HAVE_INTTYPES_H 1 2025-05-07T20:27:09.7379728Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1 2025-05-07T20:27:09.7379920Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02 2025-05-07T20:27:09.7380024Z #define __FLT64_HAS_INFINITY__ 1 2025-05-07T20:27:09.7380124Z #define _GLIBCXX_CSTDLIB 1 2025-05-07T20:27:09.7380236Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1 2025-05-07T20:27:09.7380413Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x 2025-05-07T20:27:09.7380536Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16 2025-05-07T20:27:09.7380623Z #define _STRING_H 1 2025-05-07T20:27:09.7380731Z #define _BITS_PTHREADTYPES_H 1 2025-05-07T20:27:09.7380830Z #define _GCC_MAX_ALIGN_T 2025-05-07T20:27:09.7380930Z #define __SM_32_INTRINSICS_HPP__ 2025-05-07T20:27:09.7381067Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1) 2025-05-07T20:27:09.7381173Z #define __code_model_small__ 1 2025-05-07T20:27:09.7381264Z #define _PSTL_CONFIG_H 2025-05-07T20:27:09.7381371Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2 2025-05-07T20:27:09.7381495Z #define __cpp_nontype_template_args 201411L 2025-05-07T20:27:09.7381591Z #define __SM_20_INTRINSICS_H__ 2025-05-07T20:27:09.7381709Z #define cudaCpuDeviceId ((int)-1) 2025-05-07T20:27:09.7382055Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION)) 2025-05-07T20:27:09.7382151Z #define __DEC32_MANT_DIG__ 7 2025-05-07T20:27:09.7382248Z #define le64toh(x) (x) 2025-05-07T20:27:09.7382338Z #define FILENAME_MAX 4096 2025-05-07T20:27:09.7382583Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l)) 2025-05-07T20:27:09.7382708Z #define __cpp_return_type_deduction 201304L 2025-05-07T20:27:09.7382795Z #define L_cuserid 9 2025-05-07T20:27:09.7382884Z #define __ino_t_defined 2025-05-07T20:27:09.7382977Z #define __k8__ 1 2025-05-07T20:27:09.7383079Z #define __INTPTR_TYPE__ long int 2025-05-07T20:27:09.7383190Z #define __UINT16_TYPE__ short unsigned int 2025-05-07T20:27:09.7383284Z #define __int8_t_defined 2025-05-07T20:27:09.7383377Z #define __WCHAR_TYPE__ int 2025-05-07T20:27:09.7383484Z #define __CLOCKID_T_TYPE __S32_TYPE 2025-05-07T20:27:09.7383602Z #define cudaHostRegisterPortable 0x01 2025-05-07T20:27:09.7383778Z #define __SLONGWORD_TYPE long int 2025-05-07T20:27:09.7383873Z #define _IOS_TRUNC 16 2025-05-07T20:27:09.7383994Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++" 2025-05-07T20:27:09.7384147Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l)) 2025-05-07T20:27:09.7384243Z #define __HAVE_COLUMN 2025-05-07T20:27:09.7384331Z #define __stub_fdetach 2025-05-07T20:27:09.7384751Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported. Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead." 2025-05-07T20:27:09.7384845Z #define __pic__ 2 2025-05-07T20:27:09.7384968Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:09.7385076Z #define CLOCKS_PER_SEC 1000000l 2025-05-07T20:27:09.7385172Z #define __INT_FAST64_WIDTH__ 64 2025-05-07T20:27:09.7385276Z #define _GLIBCXX_HAVE_SOCKATMARK 1 2025-05-07T20:27:09.7385370Z #define __stub_chflags 2025-05-07T20:27:09.7385460Z #define CLOCK_BOOTTIME 7 2025-05-07T20:27:09.7385547Z #define __need_IOV_MAX 2025-05-07T20:27:09.7385669Z #define putc(_ch,_fp) _IO_putc (_ch, _fp) 2025-05-07T20:27:09.7385778Z #define __UQUAD_TYPE unsigned long int 2025-05-07T20:27:09.7385880Z #define __cpp_decltype 200707L 2025-05-07T20:27:09.7385990Z #define __BYTE_ORDER __LITTLE_ENDIAN 2025-05-07T20:27:09.7386083Z #define _GLIBCXX_USE_C99 1 2025-05-07T20:27:09.7386193Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1 2025-05-07T20:27:09.7386294Z #define TTY_NAME_MAX 32 2025-05-07T20:27:09.7386463Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val) 2025-05-07T20:27:09.7386597Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:09.7386770Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition) 2025-05-07T20:27:09.7386884Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1 2025-05-07T20:27:09.7386990Z #define __LITTLE_ENDIAN 1234 2025-05-07T20:27:09.7387086Z #define STA_PPSTIME 0x0004 2025-05-07T20:27:09.7387176Z #define __import__ 2025-05-07T20:27:09.7387273Z #define BUFSIZ _IO_BUFSIZ 2025-05-07T20:27:09.7387415Z #define M_SQRT2l 1.414213562373095048801688724209698079L 2025-05-07T20:27:09.7387501Z #define __export__ 2025-05-07T20:27:09.7387629Z #define __FSID_T_TYPE struct { int __val[2]; } 2025-05-07T20:27:09.7387733Z #define cudaMemAttachHost 0x02 2025-05-07T20:27:09.7387905Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:27:09.7388004Z #define _GLIBCXX_HAVE_ICONV 1 2025-05-07T20:27:09.7388099Z #define _GLIBCXX_SYMVER 1 2025-05-07T20:27:09.7388204Z #define __FLT64X_MAX_EXP__ 16384 2025-05-07T20:27:09.7388295Z #define _WCHAR_T_DECLARED 2025-05-07T20:27:09.7388416Z #define __UINT_FAST64_TYPE__ long unsigned int 2025-05-07T20:27:09.7388542Z #define isalpha_l(c,l) __isalpha_l ((c), (l)) 2025-05-07T20:27:09.7388649Z #define __cpp_inline_variables 201606L 2025-05-07T20:27:09.7388740Z #define WNOWAIT 0x01000000 2025-05-07T20:27:09.7388831Z #define PLOSS 6 2025-05-07T20:27:09.7388926Z #define M_LN10 2.30258509299404568402 2025-05-07T20:27:09.7389191Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626) 2025-05-07T20:27:09.7389299Z #define EXIT_SUCCESS 0 2025-05-07T20:27:09.7389400Z #define __LDBL_REDIR_DECL(name) 2025-05-07T20:27:09.7389503Z #define _GLIBCXX_HAVE_STRTOF 1 2025-05-07T20:27:09.7389605Z #define MOD_FREQUENCY ADJ_FREQUENCY 2025-05-07T20:27:09.7389698Z #define __thread__ __thread 2025-05-07T20:27:09.7389802Z #define _GLIBCXX_HAVE_MEMORY_H 1 2025-05-07T20:27:09.7389987Z #define __INT_MAX__ 0x7fffffff 2025-05-07T20:27:09.7390093Z #define __SIZEOF_PTHREAD_BARRIER_T 32 2025-05-07T20:27:09.7390329Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:27:09.7390446Z #define __cudaCDP2StreamWaitEvent_ptsz 2025-05-07T20:27:09.7390543Z #define _GLIBCXX_HAVE_SINF 1 2025-05-07T20:27:09.7390635Z #define __linux__ 1 2025-05-07T20:27:09.7390733Z #define STA_PPSSIGNAL 0x0100 2025-05-07T20:27:09.7390869Z #define M_LN2l 0.693147180559945309417232121458176568L 2025-05-07T20:27:09.7390964Z #define __S16_TYPE short int 2025-05-07T20:27:09.7391427Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable() 2025-05-07T20:27:09.7391540Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1 2025-05-07T20:27:09.7391734Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1) 2025-05-07T20:27:09.7391834Z #define __COMMON_FUNCTIONS_H__ 2025-05-07T20:27:09.7391949Z #define UINT_MAX (INT_MAX * 2U + 1U) 2025-05-07T20:27:09.7392035Z #define _T_SIZE_ 2025-05-07T20:27:09.7392136Z #define LLONG_MAX __LONG_LONG_MAX__ 2025-05-07T20:27:09.7392268Z #define __cudaCDP2StreamCreateWithFlags 2025-05-07T20:27:09.7392369Z #define _PSTL_VERSION 12000 2025-05-07T20:27:09.7392521Z #define __noinline__ __attribute__((noinline)) 2025-05-07T20:27:09.7392640Z #define __WNOTHREAD 0x20000000 2025-05-07T20:27:09.7392744Z #define _G_va_list __gnuc_va_list 2025-05-07T20:27:09.7392880Z #define M_PI_4l 0.785398163397448309615660845819875721L 2025-05-07T20:27:09.7392967Z #define _IOS_INPUT 1 2025-05-07T20:27:09.7393069Z #define __USE_LARGEFILE64 1 2025-05-07T20:27:09.7393184Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1 2025-05-07T20:27:09.7393278Z #define __INT64_TYPE__ long int 2025-05-07T20:27:09.7393377Z #define _POSIX_SSIZE_MAX 32767 2025-05-07T20:27:09.7393485Z #define __shared__ __location__(shared) 2025-05-07T20:27:09.7393580Z #define __FLT_MAX_EXP__ 128 2025-05-07T20:27:09.7393743Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0) 2025-05-07T20:27:09.7393841Z #define __gid_t_defined 2025-05-07T20:27:09.7393957Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1 2025-05-07T20:27:09.7394066Z #define __ORDER_BIG_ENDIAN__ 4321 2025-05-07T20:27:09.7394267Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 2025-05-07T20:27:09.7394368Z #define _GLIBCXX17_INLINE inline 2025-05-07T20:27:09.7394467Z #define __DBL_MANT_DIG__ 53 2025-05-07T20:27:09.7394556Z #define ___int_size_t_h 2025-05-07T20:27:09.7394666Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:09.7394805Z #define __cpp_inheriting_constructors 201511L 2025-05-07T20:27:09.7394964Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED) 2025-05-07T20:27:09.7395070Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1 2025-05-07T20:27:09.7395175Z #define _GLIBCXX_HAVE_FENV_H 1 2025-05-07T20:27:09.7395275Z #define _GLIBCXX_HAVE_STDBOOL_H 1 2025-05-07T20:27:09.7395381Z #define __SIZEOF_FLOAT128__ 16 2025-05-07T20:27:09.7395512Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:09.7395629Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1 2025-05-07T20:27:09.7395758Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 2025-05-07T20:27:09.7395852Z #define __clock_t_defined 1 2025-05-07T20:27:09.7395953Z #define _POSIX_SEM_VALUE_MAX 32767 2025-05-07T20:27:09.7396075Z #define __cudaCDP2RuntimeGetVersion 2025-05-07T20:27:09.7396166Z #define __GLIBC_MINOR__ 17 2025-05-07T20:27:09.7396262Z #define __DEC64_MIN__ 1E-383DD 2025-05-07T20:27:09.7396371Z #define __WINT_TYPE__ unsigned int 2025-05-07T20:27:09.7396484Z #define __UINT_LEAST32_TYPE__ unsigned int 2025-05-07T20:27:09.7396585Z #define __SIZEOF_SHORT__ 2 2025-05-07T20:27:09.7396767Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32 2025-05-07T20:27:09.7396852Z #define __SSE__ 1 2025-05-07T20:27:09.7396958Z #define SEM_VALUE_MAX (2147483647) 2025-05-07T20:27:09.7397058Z #define M_SQRT1_2 0.70710678118654752440 2025-05-07T20:27:09.7397143Z #define _CTYPE_H 1 2025-05-07T20:27:09.7397336Z #define __sigset_t_defined 2025-05-07T20:27:09.7397437Z #define __LDBL_MIN_EXP__ (-16381) 2025-05-07T20:27:09.7397534Z #define _GLIBCXX_HAVE_LOGF 1 2025-05-07T20:27:09.7397627Z #define MOD_TAI ADJ_TAI 2025-05-07T20:27:09.7397730Z #define _IO_va_list __gnuc_va_list 2025-05-07T20:27:09.7397826Z #define _GLIBCXX_HAVE_LOGL 1 2025-05-07T20:27:09.7397917Z #define __SM_70_RT_H__ 2025-05-07T20:27:09.7398014Z #define _GLIBCXX_HAVE_WRITEV 1 2025-05-07T20:27:09.7398131Z #define cudaEventWaitDefault 0x00 2025-05-07T20:27:09.7398225Z #define _GLIBCXX_HAVE_EXPL 1 2025-05-07T20:27:09.7398469Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:27:09.7398572Z #define _POSIX_MAX_CANON 255 2025-05-07T20:27:09.7398683Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE 2025-05-07T20:27:09.7398778Z #define FD_SETSIZE __FD_SETSIZE 2025-05-07T20:27:09.7398879Z #define _GLIBCXX_TXN_SAFE 2025-05-07T20:27:09.7398962Z #define __amd64__ 1 2025-05-07T20:27:09.7399052Z #define __WINT_WIDTH__ 32 2025-05-07T20:27:09.7399168Z #define __CUDA_DEVICE_RUNTIME_API_H__ 2025-05-07T20:27:09.7399443Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias)) 2025-05-07T20:27:09.7399547Z #define _GLIBCXX_STDIO_SEEK_CUR 1 2025-05-07T20:27:09.7399638Z #define EOF (-1) 2025-05-07T20:27:09.7399732Z #define __WAIT_STATUS_DEFN void * 2025-05-07T20:27:09.7399829Z #define __USE_POSIX199309 1 2025-05-07T20:27:09.7399920Z #define __INT_LEAST64_WIDTH__ 64 2025-05-07T20:27:09.7400012Z #define __LDBL_MAX_EXP__ 16384 2025-05-07T20:27:09.7400109Z #define __FLT32X_MAX_10_EXP__ 308 2025-05-07T20:27:09.7400211Z #define LLONG_MIN (-LLONG_MAX-1) 2025-05-07T20:27:09.7400321Z #define cudaSurfaceType2DLayered 0xF2 2025-05-07T20:27:09.7400417Z #define ____mbstate_t_defined 1 2025-05-07T20:27:09.7400502Z #define STA_NANO 0x2000 2025-05-07T20:27:09.7400592Z #define _GLIBCXX_HAVE_LOG10F 1 2025-05-07T20:27:09.7400692Z #define _GLIBCXX_HAVE_LOG10L 1 2025-05-07T20:27:09.7400775Z #define _IO_LINKED 0x80 2025-05-07T20:27:09.7400877Z #define __cpp_lib_launder 201606 2025-05-07T20:27:09.7400973Z #define __SIZEOF_INT128__ 16 2025-05-07T20:27:09.7401073Z #define __PTHREAD_MUTEX_HAVE_PREV 1 2025-05-07T20:27:09.7401175Z #define __FLT64X_IS_IEC_60559__ 2 2025-05-07T20:27:09.7401268Z #define _GLIBCXX_TYPE_TRAITS 1 2025-05-07T20:27:09.7401407Z #define cudaGraphKernelNodePortProgrammatic 1 2025-05-07T20:27:09.7401523Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 2025-05-07T20:27:09.7401623Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE 2025-05-07T20:27:09.7401721Z #define __LDBL_MAX_10_EXP__ 4932 2025-05-07T20:27:09.7401824Z #define __W_CONTINUED 0xffff 2025-05-07T20:27:09.7401910Z #define __ATOMIC_RELAXED 0 2025-05-07T20:27:09.7402039Z #define w_coredump __wait_terminated.__w_coredump 2025-05-07T20:27:09.7402167Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE 2025-05-07T20:27:09.7402369Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 2025-05-07T20:27:09.7402564Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L) 2025-05-07T20:27:09.7402646Z #define __stub_stty 2025-05-07T20:27:09.7402813Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)]) 2025-05-07T20:27:09.7402904Z #define le16toh(x) (x) 2025-05-07T20:27:09.7403009Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX 2025-05-07T20:27:09.7403184Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128 2025-05-07T20:27:09.7403272Z #define _SIZET_ 2025-05-07T20:27:09.7403362Z #define XATTR_NAME_MAX 255 2025-05-07T20:27:09.7403447Z #define _SVID_SOURCE 1 2025-05-07T20:27:09.7403532Z #define _LP64 1 2025-05-07T20:27:09.7403623Z #define _LIBC_LIMITS_H_ 1 2025-05-07T20:27:09.7403859Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias) 2025-05-07T20:27:09.7403976Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1 2025-05-07T20:27:09.7404061Z #define __UINT8_C(c) c 2025-05-07T20:27:09.7404161Z #define _GLIBCXX_HAVE_CEILF 1 2025-05-07T20:27:09.7404252Z #define _GLIBCXX_HAVE_CEILL 1 2025-05-07T20:27:09.7404451Z #define __cudaCDP2Memset3DAsync_ptsz 2025-05-07T20:27:09.7404552Z #define __CUDA_ARCH_LIST__ 520 2025-05-07T20:27:09.7404643Z #define __FLT64_MAX_EXP__ 1024 2025-05-07T20:27:09.7404740Z #define MOD_MAXERROR ADJ_MAXERROR 2025-05-07T20:27:09.7404828Z #define CUDARTAPI 2025-05-07T20:27:09.7404909Z #define IOV_MAX 1024 2025-05-07T20:27:09.7405070Z #define __glibcxx_requires_irreflexive2(_First,_Last) 2025-05-07T20:27:09.7405201Z #define __INT_LEAST32_TYPE__ int 2025-05-07T20:27:09.7405751Z #define cudaMemAttachSingle 0x04 2025-05-07T20:27:09.7405870Z #define __wchar_t__ 2025-05-07T20:27:09.7406099Z #define __cpp_lib_is_aggregate 201703 2025-05-07T20:27:09.7406183Z #define SEEK_END 2 2025-05-07T20:27:09.7406282Z #define __SIZEOF_WCHAR_T__ 4 2025-05-07T20:27:09.7406457Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include() 2025-05-07T20:27:09.7406556Z #define _IO_ftrylockfile(_fp) 2025-05-07T20:27:09.7406707Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR 2025-05-07T20:27:09.7406801Z #define ____FILE_defined 1 2025-05-07T20:27:09.7406918Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1 2025-05-07T20:27:09.7407019Z #define __GNUC_PATCHLEVEL__ 0 2025-05-07T20:27:09.7407105Z #define _ISOC99_SOURCE 1 2025-05-07T20:27:09.7407198Z #define __VECTOR_FUNCTIONS_H__ 2025-05-07T20:27:09.7407468Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias)) 2025-05-07T20:27:09.7407600Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 2025-05-07T20:27:09.7407692Z #define _IO_RIGHT 04 2025-05-07T20:27:09.7407783Z #define __END_NAMESPACE_STD 2025-05-07T20:27:09.7407971Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128 2025-05-07T20:27:09.7408074Z #define _GLIBCXX_STD_C std 2025-05-07T20:27:09.7408192Z #define cudaInitDeviceFlagsAreValid 0x01 2025-05-07T20:27:09.7408287Z #define _LARGEFILE64_SOURCE 1 2025-05-07T20:27:09.7408397Z #define _GLIBCXX_USE_C99_STDINT_TR1 1 2025-05-07T20:27:09.7408476Z #define _STDDEF_H_ 2025-05-07T20:27:09.7408654Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64 2025-05-07T20:27:09.7408757Z #define __FLT128_HAS_QUIET_NAN__ 1 2025-05-07T20:27:09.7408874Z #define isalnum_l(c,l) __isalnum_l ((c), (l)) 2025-05-07T20:27:09.7409080Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0) 2025-05-07T20:27:09.7409191Z #define __INTMAX_MAX__ 0x7fffffffffffffffL 2025-05-07T20:27:09.7409334Z #define __glibcxx_requires_irreflexive(_First,_Last) 2025-05-07T20:27:09.7409461Z #define cudaGraphKernelNodePortDefault 0 2025-05-07T20:27:09.7409562Z #define __INT_FAST8_TYPE__ signed char 2025-05-07T20:27:09.7409675Z #define __cudaCDP2Memcpy3DAsync_ptsz 2025-05-07T20:27:09.7409777Z #define __PID_T_TYPE __S32_TYPE 2025-05-07T20:27:09.7409887Z #define __cpp_namespace_attributes 201411L 2025-05-07T20:27:09.7409983Z #define CHARCLASS_NAME_MAX 2048 2025-05-07T20:27:09.7410080Z #define _GLIBCXX_HAVE_TANF 1 2025-05-07T20:27:09.7410174Z #define _GLIBCXX_USE_ST_MTIM 1 2025-05-07T20:27:09.7410356Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x 2025-05-07T20:27:09.7410445Z #define __CUDA_RUNTIME_H__ 2025-05-07T20:27:09.7410625Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status)) 2025-05-07T20:27:09.7410726Z #define _GLIBCXX_HAVE_STDLIB_H 1 2025-05-07T20:27:09.7410817Z #define __STDCPP_THREADS__ 1 2025-05-07T20:27:09.7410960Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L 2025-05-07T20:27:09.7411058Z #define __GNUC_STDC_INLINE__ 1 2025-05-07T20:27:09.7411149Z #define _POSIX_UIO_MAXIOV 16 2025-05-07T20:27:09.7411247Z #define _PSTL_PAR_BACKEND_SERIAL 2025-05-07T20:27:09.7411352Z #define P_tmpdir "/tmp" 2025-05-07T20:27:09.7411470Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__ 2025-05-07T20:27:09.7411562Z #define __FLT64_HAS_DENORM__ 1 2025-05-07T20:27:09.7411666Z #define __WORDSIZE_TIME64_COMPAT32 1 2025-05-07T20:27:09.7411830Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__)) 2025-05-07T20:27:09.7412006Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32 2025-05-07T20:27:09.7412196Z #define _PSTL_HIDE_FROM_ABI_PUSH 2025-05-07T20:27:09.7412319Z #define cudaStreamLegacy ((cudaStream_t)0x1) 2025-05-07T20:27:09.7412438Z #define _IO_cleanup_region_start(_fct,_fp) 2025-05-07T20:27:09.7412539Z #define __location__(a) __annotate__(a) 2025-05-07T20:27:09.7412770Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type) 2025-05-07T20:27:09.7412874Z #define _POSIX2_BC_BASE_MAX 99 2025-05-07T20:27:09.7412985Z #define __cudaCDP2DeviceGetAttribute 2025-05-07T20:27:09.7413077Z #define __DBL_DECIMAL_DIG__ 17 2025-05-07T20:27:09.7413255Z #define __STDC_UTF_32__ 1 2025-05-07T20:27:09.7413347Z #define __INT_FAST8_WIDTH__ 8 2025-05-07T20:27:09.7413449Z #define NAN (__builtin_nanf ("")) 2025-05-07T20:27:09.7413543Z #define _POSIX_MQ_PRIO_MAX 32 2025-05-07T20:27:09.7413623Z #define __FXSR__ 1 2025-05-07T20:27:09.7413707Z #define _SIZE_T 2025-05-07T20:27:09.7413807Z #define _GLIBCXX_USE_GETTIMEOFDAY 1 2025-05-07T20:27:09.7413924Z #define cudaHostRegisterReadOnly 0x08 2025-05-07T20:27:09.7414099Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:27:09.7414248Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f) 2025-05-07T20:27:09.7414339Z #define _IO_ssize_t __ssize_t 2025-05-07T20:27:09.7414444Z #define __ULONG32_TYPE unsigned int 2025-05-07T20:27:09.7414741Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L) 2025-05-07T20:27:09.7414952Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000 2025-05-07T20:27:09.7415042Z #define _GXX_NULLPTR_T 2025-05-07T20:27:09.7415171Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 2025-05-07T20:27:09.7415266Z #define FOPEN_MAX 16 2025-05-07T20:27:09.7415355Z #define __BIG_ENDIAN 4321 2025-05-07T20:27:09.7415473Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ 2025-05-07T20:27:09.7415577Z #define __suseconds_t_defined 2025-05-07T20:27:09.7415668Z #define __off_t_defined 2025-05-07T20:27:09.7415751Z #define stderr stderr 2025-05-07T20:27:09.7415858Z #define M_LOG10E 0.43429448190325182765 2025-05-07T20:27:09.7415970Z #define __glibcxx_requires_string(_String) 2025-05-07T20:27:09.7416066Z #define _GLIBCXX_HAVE_LDEXPL 1 2025-05-07T20:27:09.7417939Z #define __INTMAX_WIDTH__ 64 2025-05-07T20:27:09.7418348Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304) 2025-05-07T20:27:09.7418443Z #define __mode_t_defined 2025-05-07T20:27:09.7418524Z #define _GCC_SIZE_T 2025-05-07T20:27:09.7418621Z #define __INO64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:09.7418732Z #define __cpp_runtime_arrays 198712L 2025-05-07T20:27:09.7418837Z #define __UINT64_TYPE__ long unsigned int 2025-05-07T20:27:09.7418929Z #define __USE_XOPEN2K8XSI 1 2025-05-07T20:27:09.7419027Z #define __UINT32_C(c) c ## U 2025-05-07T20:27:09.7419132Z #define __cpp_alias_templates 200704L 2025-05-07T20:27:09.7419236Z #define cudaHostAllocMapped 0x02 2025-05-07T20:27:09.7419349Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 2025-05-07T20:27:09.7419438Z #define _STL_ITERATOR_H 1 2025-05-07T20:27:09.7419528Z #define __size_t__ 2025-05-07T20:27:09.7419659Z #define cudaStreamAttrID cudaLaunchAttributeID 2025-05-07T20:27:09.7419754Z #define _GLIBCXX_HAVE_ATANF 1 2025-05-07T20:27:09.7419870Z #define cudaEventRecordExternal 0x01 2025-05-07T20:27:09.7420020Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l)) 2025-05-07T20:27:09.7420111Z #define _IO_BUFSIZ _G_BUFSIZ 2025-05-07T20:27:09.7420300Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F 2025-05-07T20:27:09.7420385Z #define _ENDIAN_H 1 2025-05-07T20:27:09.7420504Z #define __builtin_align__(a) __align__(a) 2025-05-07T20:27:09.7420599Z #define _GLIBCXX20_CONSTEXPR 2025-05-07T20:27:09.7420700Z #define __NV_NO_HOST_COMPILER_CHECK 1 2025-05-07T20:27:09.7420788Z #define __try try 2025-05-07T20:27:09.7420888Z #define _GLIBCXX_HAVE_FINITE 1 2025-05-07T20:27:09.7420983Z #define __FLT128_IS_IEC_60559__ 2 2025-05-07T20:27:09.7421077Z #define __INT8_MAX__ 0x7f 2025-05-07T20:27:09.7421466Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2) 2025-05-07T20:27:09.7421556Z #define __LONG_WIDTH__ 64 2025-05-07T20:27:09.7421642Z #define __PIC__ 2 2025-05-07T20:27:09.7421758Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX 2025-05-07T20:27:09.7421877Z #define __UINT_FAST32_TYPE__ long unsigned int 2025-05-07T20:27:09.7422014Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp) 2025-05-07T20:27:09.7422111Z #define _GLIBCXX_HAVE_FLOAT_H 1 2025-05-07T20:27:09.7422210Z #define _GLIBCXX_HAVE_ATANL 1 2025-05-07T20:27:09.7422480Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x 2025-05-07T20:27:09.7422606Z #define __DEVICE_FUNCTIONS_HPP__ 2025-05-07T20:27:09.7422721Z #define __CHAR32_TYPE__ unsigned int 2025-05-07T20:27:09.7422824Z #define _IO_uid_t __uid_t 2025-05-07T20:27:09.7422921Z #define _GLIBCXX_HAVE_READLINK 1 2025-05-07T20:27:09.7423055Z #define __cudaCDP2EventRecordWithFlags_ptsz 2025-05-07T20:27:09.7423152Z #define _CONCEPT_CHECK_H 1 2025-05-07T20:27:09.7423298Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F 2025-05-07T20:27:09.7423407Z #define _GLIBCXX_HAVE_NETINET_IN_H 1 2025-05-07T20:27:09.7423535Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1 2025-05-07T20:27:09.7423627Z #define LONG_BIT 64 2025-05-07T20:27:09.7423735Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4 2025-05-07T20:27:09.7423835Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1 2025-05-07T20:27:09.7423971Z #define __cpp_lib_math_special_functions 201603L 2025-05-07T20:27:09.7424065Z #define __fsfilcnt_t_defined 2025-05-07T20:27:09.7424160Z #define __blkcnt_t_defined 2025-05-07T20:27:09.7424442Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain 2025-05-07T20:27:09.7424533Z #define __USE_LARGEFILE 1 2025-05-07T20:27:09.7424630Z #define __cpp_constexpr 201603L 2025-05-07T20:27:09.7424731Z #define CUDART_VERSION 12060 2025-05-07T20:27:09.7424819Z #define NL_TEXTMAX INT_MAX 2025-05-07T20:27:09.7424924Z #define cudaDeviceMapHost 0x08 2025-05-07T20:27:09.7425017Z #define _GLIBCXX_CMATH 1 2025-05-07T20:27:09.7425215Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x))) 2025-05-07T20:27:09.7425313Z #define __lldiv_t_defined 1 2025-05-07T20:27:09.7425641Z #define __SSE2__ 1 2025-05-07T20:27:09.7425781Z #define _IOLBF 1 2025-05-07T20:27:09.7425903Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1 2025-05-07T20:27:09.7425999Z #define _GLIBCXX_HAVE_FLOORF 1 2025-05-07T20:27:09.7426105Z #define __cpp_deduction_guides 201703L 2025-05-07T20:27:09.7426205Z #define _GLIBCXX_HAVE_EXPF 1 2025-05-07T20:27:09.7426323Z #define __annotate__(a) __attribute__((a)) 2025-05-07T20:27:09.7426411Z #define __INT32_TYPE__ int 2025-05-07T20:27:09.7426506Z #define __SIZEOF_DOUBLE__ 8 2025-05-07T20:27:09.7426612Z #define cudaDeviceSyncMemops 0x80 2025-05-07T20:27:09.7426712Z #define __cpp_exceptions 199711L 2025-05-07T20:27:09.7426820Z #define __FLT_MIN_10_EXP__ (-37) 2025-05-07T20:27:09.7426931Z #define cudaDeviceScheduleYield 0x02 2025-05-07T20:27:09.7427035Z #define _SYS_SYSMACROS_H 1 2025-05-07T20:27:09.7427151Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1 2025-05-07T20:27:09.7427315Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64 2025-05-07T20:27:09.7427421Z #define __INT_LEAST32_WIDTH__ 32 2025-05-07T20:27:09.7427519Z #define __SWORD_TYPE long int 2025-05-07T20:27:09.7427613Z #define __INTMAX_TYPE__ long int 2025-05-07T20:27:09.7427719Z #define _GLIBCXX11_USE_C99_MATH 1 2025-05-07T20:27:09.7427813Z #define __PTHREAD_SPINS 0, 0 2025-05-07T20:27:09.7427904Z #define _BITS_POSIX1_LIM_H 1 2025-05-07T20:27:09.7428200Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap 2025-05-07T20:27:09.7428295Z #define __DEC128_MAX_EXP__ 6145 2025-05-07T20:27:09.7428453Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT) 2025-05-07T20:27:09.7428535Z #define _T_SIZE 2025-05-07T20:27:09.7428640Z #define cudaHostAllocDefault 0x00 2025-05-07T20:27:09.7428773Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 2025-05-07T20:27:09.7429183Z #define __va_arg_pack() __builtin_va_arg_pack () 2025-05-07T20:27:09.7429282Z #define _POSIX_TIMER_MAX 32 2025-05-07T20:27:09.7429382Z #define _GLIBCXX_HAVE_TLS 1 2025-05-07T20:27:09.7429503Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT 2025-05-07T20:27:09.7429602Z #define _GLIBCXX_HAVE_ACOSL 1 2025-05-07T20:27:09.7429706Z #define __FLT32X_HAS_QUIET_NAN__ 1 2025-05-07T20:27:09.7429795Z #define __ATOMIC_CONSUME 1 2025-05-07T20:27:09.7429979Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT 2025-05-07T20:27:09.7430067Z #define __GNUC_MINOR__ 4 2025-05-07T20:27:09.7430307Z #define __GLIBCXX_TYPE_INT_N_0 __int128 2025-05-07T20:27:09.7430408Z #define __INT_FAST16_WIDTH__ 64 2025-05-07T20:27:09.7430525Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:09.7430611Z #define __PIE__ 2 2025-05-07T20:27:09.7430720Z #define LITTLE_ENDIAN __LITTLE_ENDIAN 2025-05-07T20:27:09.7430821Z #define _GLIBCXX_HAVE_INT64_T_LONG 1 2025-05-07T20:27:09.7431019Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x 2025-05-07T20:27:09.7431252Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:27:09.7431343Z #define __nlink_t_defined 2025-05-07T20:27:09.7431472Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]] 2025-05-07T20:27:09.7431593Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x) 2025-05-07T20:27:09.7431679Z #define _XOPEN_LIM_H 1 2025-05-07T20:27:09.7431955Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE))) 2025-05-07T20:27:09.7432075Z #define __cpp_template_template_args 201611L 2025-05-07T20:27:09.7432186Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1 2025-05-07T20:27:09.7432299Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX 2025-05-07T20:27:09.7432406Z #define __DBL_MAX_10_EXP__ 308 2025-05-07T20:27:09.7432507Z #define __FILE_defined 1 2025-05-07T20:27:09.7432718Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L 2025-05-07T20:27:09.7432816Z #define _GLIBCXX_HAVE_SINCOS 1 2025-05-07T20:27:09.7432917Z #define __USE_XOPEN_EXTENDED 1 2025-05-07T20:27:09.7433033Z #define __cpp_lib_tuple_element_t 201402L 2025-05-07T20:27:09.7433153Z #define isascii_l(c,l) __isascii_l ((c), (l)) 2025-05-07T20:27:09.7433271Z #define cudaInvalidDeviceId ((int)-2) 2025-05-07T20:27:09.7433374Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1 2025-05-07T20:27:09.7433460Z #define __INT16_C(c) c 2025-05-07T20:27:09.7433565Z #define __U32_TYPE unsigned int 2025-05-07T20:27:09.7433665Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1 2025-05-07T20:27:09.7433789Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp) 2025-05-07T20:27:09.7433883Z #define __STDC__ 1 2025-05-07T20:27:09.7433980Z #define _GLIBCXX_HAVE_VWSCANF 1 2025-05-07T20:27:09.7434080Z #define _GLIBCXX_HAVE_EXECINFO_H 1 2025-05-07T20:27:09.7434185Z #define _GLIBCXX_USE_REALPATH 1 2025-05-07T20:27:09.7434339Z #define __attribute_malloc__ __attribute__ ((__malloc__)) 2025-05-07T20:27:09.7434436Z #define __FLT32X_DIG__ 15 2025-05-07T20:27:09.7434542Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1 2025-05-07T20:27:09.7434639Z #define __PTRDIFF_TYPE__ long int 2025-05-07T20:27:09.7434761Z #define cudaArrayDeferredMapping 0x80 2025-05-07T20:27:09.7434875Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 2025-05-07T20:27:09.7434974Z #define USHRT_MAX (SHRT_MAX * 2 + 1) 2025-05-07T20:27:09.7435087Z #define __cpp_lib_is_swappable 201603 2025-05-07T20:27:09.7435173Z #define stdin stdin 2025-05-07T20:27:09.7435263Z #define __ino64_t_defined 2025-05-07T20:27:09.7435355Z #define STA_CLK 0x8000 2025-05-07T20:27:09.7435453Z #define __clockid_t_defined 1 2025-05-07T20:27:09.7435609Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__) 2025-05-07T20:27:09.7435783Z #define __attribute_noinline__ __attribute__ ((__noinline__)) 2025-05-07T20:27:09.7435891Z #define __cudaCDP2MemsetAsync 2025-05-07T20:27:09.7435998Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 2025-05-07T20:27:09.7436103Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 2025-05-07T20:27:09.7436211Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1 2025-05-07T20:27:09.7436512Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d))) 2025-05-07T20:27:09.7436606Z #define __ATOMIC_SEQ_CST 5 2025-05-07T20:27:09.7437141Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; })) 2025-05-07T20:27:09.7437237Z #define DOMAIN 1 2025-05-07T20:27:09.7437330Z #define M_LN2 0.69314718055994530942 2025-05-07T20:27:09.7437416Z #define __NVCC__ 1 2025-05-07T20:27:09.7437606Z #define __cudaCDP2Memset2DAsync 2025-05-07T20:27:09.7437725Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE 2025-05-07T20:27:09.7437838Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 2025-05-07T20:27:09.7437944Z #define __throw_exception_again throw 2025-05-07T20:27:09.7438041Z #define M_SQRT2 1.41421356237309504880 2025-05-07T20:27:09.7438142Z #define __EXCEPTION_H 1 2025-05-07T20:27:09.7438248Z #define __FLT32X_MIN_10_EXP__ (-307) 2025-05-07T20:27:09.7438354Z #define HUGE_VAL (__builtin_huge_val()) 2025-05-07T20:27:09.7438775Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow 2025-05-07T20:27:09.7438897Z #define __UINTPTR_TYPE__ long unsigned int 2025-05-07T20:27:09.7439011Z #define _GLIBCXX_INLINE_VERSION 0 2025-05-07T20:27:09.7439112Z #define _GLIBCXX_USE_INT128 1 2025-05-07T20:27:09.7439220Z #define __cpp_lib_bool_constant 201505 2025-05-07T20:27:09.7439326Z #define PTHREAD_KEYS_MAX 1024 2025-05-07T20:27:09.7439469Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD 2025-05-07T20:27:09.7439582Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE 2025-05-07T20:27:09.7439701Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1 2025-05-07T20:27:09.7439797Z #define __DEC128_MANT_DIG__ 34 2025-05-07T20:27:09.7439902Z #define __cpp_lib_tuples_by_type 201304 2025-05-07T20:27:09.7440007Z #define __LDBL_MIN_10_EXP__ (-4931) 2025-05-07T20:27:09.7440110Z #define __cpp_generic_lambdas 201304L 2025-05-07T20:27:09.7440251Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC)) 2025-05-07T20:27:09.7440354Z #define __useconds_t_defined 2025-05-07T20:27:09.7440456Z #define _GLIBCXX_USE_SCHED_YIELD 1 2025-05-07T20:27:09.7440647Z #define __attribute_deprecated__ __attribute__ ((__deprecated__)) 2025-05-07T20:27:09.7440798Z #define __cpp_lib_type_trait_variable_templates 201510L 2025-05-07T20:27:09.7440887Z #define __SSE_MATH__ 1 2025-05-07T20:27:09.7440984Z #define _IO_wint_t wint_t 2025-05-07T20:27:09.7441079Z #define __SIZEOF_LONG_LONG__ 8 2025-05-07T20:27:09.7441169Z #define _GLIBCXX_VERBOSE 1 2025-05-07T20:27:09.7441277Z #define _GLIBCXX_HAVE_ASINF 1 2025-05-07T20:27:09.7441392Z #define __cpp_user_defined_literals 200809L 2025-05-07T20:27:09.7441490Z #define _GLIBCXX_HAVE_ISINFL 1 2025-05-07T20:27:09.7441589Z #define _GLIBCXX_HAVE_ASINL 1 2025-05-07T20:27:09.7441675Z #define __USE_ATFILE 1 2025-05-07T20:27:09.7441771Z #define _POSIX_OPEN_MAX 20 2025-05-07T20:27:09.7441866Z #define _POSIX_LOGIN_NAME_MAX 9 2025-05-07T20:27:09.7441959Z #define _GCC_PTRDIFF_T 2025-05-07T20:27:09.7442198Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority 2025-05-07T20:27:09.7442297Z #define __FLT128_DECIMAL_DIG__ 36 2025-05-07T20:27:09.7442397Z #define _POSIX_THREAD_KEYS_MAX 128 2025-05-07T20:27:09.7442507Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2 2025-05-07T20:27:09.7442617Z #define __cpp_lib_array_constexpr 201803L 2025-05-07T20:27:09.7442704Z #define _STDLIB_H 1 2025-05-07T20:27:09.7442851Z #define __exctype(name) extern int name (int) __THROW 2025-05-07T20:27:09.7442950Z #define __FLT32_HAS_QUIET_NAN__ 1 2025-05-07T20:27:09.7443046Z #define __FLT_DECIMAL_DIG__ 9 2025-05-07T20:27:09.7443183Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL 2025-05-07T20:27:09.7443293Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 2025-05-07T20:27:09.7443397Z #define __SM_61_INTRINSICS_H__ 2025-05-07T20:27:09.7443583Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused" 2025-05-07T20:27:09.7443838Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l)) 2025-05-07T20:27:09.7443956Z #define __glibcxx_requires_nonempty() 2025-05-07T20:27:09.7444075Z #define w_stopsig __wait_stopped.__w_stopsig 2025-05-07T20:27:09.7444170Z #define __ldiv_t_defined 1 2025-05-07T20:27:09.7444360Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 2025-05-07T20:27:09.7444460Z #define ___int_ptrdiff_t_h 2025-05-07T20:27:09.7444630Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L 2025-05-07T20:27:09.7444745Z #define __cudaCDP2EventDestroy 2025-05-07T20:27:09.7444837Z #define __HOST_DEFINES_H__ 2025-05-07T20:27:09.7445132Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2 2025-05-07T20:27:09.7445238Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 2025-05-07T20:27:09.7445343Z #define _GLIBCXX_USE_NANOSLEEP 1 2025-05-07T20:27:09.7445438Z #define CUDART_CB 2025-05-07T20:27:09.7445545Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX 2025-05-07T20:27:09.7445674Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1 2025-05-07T20:27:09.7445782Z #define MB_LEN_MAX 16 2025-05-07T20:27:09.7446011Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 2025-05-07T20:27:09.7446119Z #define _GLIBCXX11_USE_C99_WCHAR 1 2025-05-07T20:27:09.7446260Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp) 2025-05-07T20:27:09.7446378Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1 2025-05-07T20:27:09.7446489Z #define _GLIBCXX_HAVE_UNISTD_H 1 2025-05-07T20:27:09.7446645Z #define __glibc_likely(cond) __builtin_expect((cond), 1) 2025-05-07T20:27:09.7446761Z #define __UINT_FAST8_TYPE__ unsigned char 2025-05-07T20:27:09.7446857Z #define _GNU_SOURCE 1 2025-05-07T20:27:09.7446954Z #define __stub_putmsg 2025-05-07T20:27:09.7447042Z #define __CUDACC__ 1 2025-05-07T20:27:09.7447141Z #define __N(msgid) (msgid) 2025-05-07T20:27:09.7447230Z #define __P(args) args 2025-05-07T20:27:09.7447489Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative 2025-05-07T20:27:09.7447600Z #define __cpp_init_captures 201304L 2025-05-07T20:27:09.7447716Z #define _GLIBCXX17_CONSTEXPR constexpr 2025-05-07T20:27:09.7447809Z #define __ATOMIC_ACQ_REL 4 2025-05-07T20:27:09.7447917Z #define __cpp_lib_as_const 201510 2025-05-07T20:27:09.7448003Z #define __WCHAR_T 2025-05-07T20:27:09.7448103Z #define __ATOMIC_RELEASE 3 2025-05-07T20:27:09.7448202Z #define __fsblkcnt_t_defined 2025-05-07T20:27:09.7448325Z #define __cudaCDP2EventCreateWithFlags 2025-05-07T20:27:09.7448442Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 2025-05-07T20:27:09.7448449Z 2025-05-07T20:27:09.7692742Z 2025-05-07T20:27:09.7693452Z + conda run -n build_binary nvcc --version 2025-05-07T20:27:09.7693501Z 2025-05-07T20:27:11.6739684Z nvcc: NVIDIA (R) Cuda compiler driver 2025-05-07T20:27:11.6740067Z Copyright (c) 2005-2024 NVIDIA Corporation 2025-05-07T20:27:11.6740393Z Built on Tue_Oct_29_23:50:19_PDT_2024 2025-05-07T20:27:11.6740708Z Cuda compilation tools, release 12.6, V12.6.85 2025-05-07T20:27:11.6741051Z Build cuda_12.6.r12.6/compiler.35059454_0 2025-05-07T20:27:11.6741260Z 2025-05-07T20:27:11.7449183Z 2025-05-07T20:27:11.7459726Z /usr/bin/nvidia-smi 2025-05-07T20:27:11.7464889Z + nvidia-smi 2025-05-07T20:27:11.7465178Z 2025-05-07T20:27:11.7641046Z Wed May 7 20:27:11 2025 2025-05-07T20:27:11.7641438Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:11.7641954Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T20:27:11.7642460Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:27:11.7642960Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T20:27:11.7643538Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T20:27:11.7643981Z | | | MIG M. | 2025-05-07T20:27:11.7644318Z |=========================================+========================+======================| 2025-05-07T20:27:11.7811386Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T20:27:11.7811874Z | 0% 27C P8 16W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T20:27:11.7812272Z | | | N/A | 2025-05-07T20:27:11.7812677Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T20:27:11.7815734Z 2025-05-07T20:27:11.7816458Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:11.7816900Z | Processes: | 2025-05-07T20:27:11.7817354Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T20:27:11.7817782Z | ID ID Usage | 2025-05-07T20:27:11.7818136Z |=========================================================================================| 2025-05-07T20:27:11.7822390Z | No running processes found | 2025-05-07T20:27:11.7822972Z +-----------------------------------------------------------------------------------------+ 2025-05-07T20:27:12.0532939Z 2025-05-07T20:27:12.0537937Z [INSTALL] Successfully installed CUDA 12.6.3 2025-05-07T20:27:12.0599120Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:27:12.0599706Z . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.6.3 2025-05-07T20:27:12.0611722Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:27:12.0612088Z env: 2025-05-07T20:27:12.0612326Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:27:12.0612637Z BUILD_ENV: build_binary 2025-05-07T20:27:12.0612902Z BUILD_TARGET: genai 2025-05-07T20:27:12.0613182Z BUILD_VARIANT: cuda 2025-05-07T20:27:12.0613437Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:27:12.0613700Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:27:12.0614013Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:27:12.0614364Z ##[endgroup] 2025-05-07T20:27:12.4027960Z ################################################################################ 2025-05-07T20:27:12.4028510Z # Install PyTorch (PIP) 2025-05-07T20:27:12.4028804Z # 2025-05-07T20:27:12.4044457Z # [2025-05-07T20:27:12.404Z] + install_pytorch_pip build_binary nightly cuda/12.6.3 2025-05-07T20:27:12.4044976Z ################################################################################ 2025-05-07T20:27:12.4045230Z 2025-05-07T20:27:12.4074321Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y numpy 2025-05-07T20:27:13.3965994Z Channels: 2025-05-07T20:27:13.3966388Z - conda-forge 2025-05-07T20:27:13.3966753Z Platform: linux-64 2025-05-07T20:27:16.8609194Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:27:17.5989106Z Solving environment: \ | / done 2025-05-07T20:27:17.8163379Z 2025-05-07T20:27:17.8163685Z ## Package Plan ## 2025-05-07T20:27:17.8163927Z 2025-05-07T20:27:17.8164209Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:27:17.8164673Z 2025-05-07T20:27:17.8164833Z added / updated specs: 2025-05-07T20:27:17.8165084Z - numpy 2025-05-07T20:27:17.8165201Z 2025-05-07T20:27:17.8165229Z 2025-05-07T20:27:17.8165352Z The following packages will be downloaded: 2025-05-07T20:27:17.8165587Z 2025-05-07T20:27:17.8165701Z package | build 2025-05-07T20:27:17.8166035Z ---------------------------|----------------- 2025-05-07T20:27:17.8166417Z libblas-3.9.0 |31_h59b9bed_openblas 16 KB conda-forge 2025-05-07T20:27:17.8166878Z libcblas-3.9.0 |31_he106b2a_openblas 16 KB conda-forge 2025-05-07T20:27:17.8167408Z libgfortran-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T20:27:17.8168081Z libgfortran5-15.1.0 | hcea5267_2 1.5 MB conda-forge 2025-05-07T20:27:17.8168729Z liblapack-3.9.0 |31_h7ac8fdf_openblas 16 KB conda-forge 2025-05-07T20:27:17.8169214Z libopenblas-0.3.29 |pthreads_h94d23a6_0 5.6 MB conda-forge 2025-05-07T20:27:17.8169677Z numpy-2.2.5 | py312h72c5963_0 8.1 MB conda-forge 2025-05-07T20:27:17.8170071Z ------------------------------------------------------------ 2025-05-07T20:27:17.8170723Z Total: 15.4 MB 2025-05-07T20:27:17.8170945Z 2025-05-07T20:27:17.8171072Z The following NEW packages will be INSTALLED: 2025-05-07T20:27:17.8171302Z 2025-05-07T20:27:17.8171526Z libblas conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 2025-05-07T20:27:17.8172032Z libcblas conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 2025-05-07T20:27:17.8172555Z libgfortran conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 2025-05-07T20:27:17.8173073Z libgfortran5 conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 2025-05-07T20:27:17.8173608Z liblapack conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 2025-05-07T20:27:17.8174157Z libopenblas conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 2025-05-07T20:27:17.8175018Z numpy conda-forge/linux-64::numpy-2.2.5-py312h72c5963_0 2025-05-07T20:27:17.8175313Z 2025-05-07T20:27:17.8175317Z 2025-05-07T20:27:17.8175321Z 2025-05-07T20:27:17.8175462Z Downloading and Extracting Packages: ...working... 2025-05-07T20:27:17.8175837Z numpy-2.2.5 | 8.1 MB | | 0% 2025-05-07T20:27:17.8176063Z 2025-05-07T20:27:17.8183864Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:27:17.8184213Z 2025-05-07T20:27:17.8184219Z 2025-05-07T20:27:17.8200070Z libgfortran5-15.1.0 | 1.5 MB | | 0%  2025-05-07T20:27:17.8200454Z 2025-05-07T20:27:17.8200463Z 2025-05-07T20:27:17.8200991Z 2025-05-07T20:27:17.8216841Z libgfortran-15.1.0 | 34 KB | | 0%  2025-05-07T20:27:17.8217125Z 2025-05-07T20:27:17.8217132Z 2025-05-07T20:27:17.8217136Z 2025-05-07T20:27:17.8227662Z 2025-05-07T20:27:17.8239576Z libblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:17.8239956Z 2025-05-07T20:27:17.8239978Z 2025-05-07T20:27:17.8239984Z 2025-05-07T20:27:17.8239993Z 2025-05-07T20:27:17.8240012Z 2025-05-07T20:27:17.8241133Z libcblas-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:17.8241531Z 2025-05-07T20:27:17.8241537Z 2025-05-07T20:27:17.8241542Z 2025-05-07T20:27:17.8241547Z 2025-05-07T20:27:17.8241552Z 2025-05-07T20:27:17.8241561Z 2025-05-07T20:27:17.9487254Z liblapack-3.9.0 | 16 KB | | 0%  2025-05-07T20:27:17.9487694Z 2025-05-07T20:27:17.9487700Z 2025-05-07T20:27:17.9487705Z 2025-05-07T20:27:17.9605060Z 2025-05-07T20:27:17.9658937Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:17.9659335Z 2025-05-07T20:27:17.9659340Z 2025-05-07T20:27:17.9659346Z 2025-05-07T20:27:17.9659351Z 2025-05-07T20:27:17.9662509Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:17.9662893Z 2025-05-07T20:27:17.9662898Z 2025-05-07T20:27:17.9662905Z 2025-05-07T20:27:18.0208179Z libgfortran-15.1.0 | 34 KB | ####7 | 47%  2025-05-07T20:27:18.0208547Z 2025-05-07T20:27:18.0208551Z 2025-05-07T20:27:18.0210820Z 2025-05-07T20:27:18.0821971Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:27:18.0822280Z 2025-05-07T20:27:18.0822284Z 2025-05-07T20:27:18.0822287Z 2025-05-07T20:27:18.0822291Z 2025-05-07T20:27:18.0822851Z 2025-05-07T20:27:18.0835035Z libcblas-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:27:18.0835456Z 2025-05-07T20:27:18.0835462Z 2025-05-07T20:27:18.0835468Z 2025-05-07T20:27:18.0835473Z 2025-05-07T20:27:18.0835478Z 2025-05-07T20:27:18.0835483Z 2025-05-07T20:27:18.0865360Z liblapack-3.9.0 | 16 KB | #########7 | 98%  2025-05-07T20:27:18.0865819Z 2025-05-07T20:27:18.0865825Z 2025-05-07T20:27:18.0865832Z 2025-05-07T20:27:18.0865838Z 2025-05-07T20:27:18.0865844Z 2025-05-07T20:27:18.0892560Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:18.0892987Z 2025-05-07T20:27:18.0892994Z 2025-05-07T20:27:18.0893330Z 2025-05-07T20:27:18.0893334Z 2025-05-07T20:27:18.0893338Z 2025-05-07T20:27:18.0893506Z 2025-05-07T20:27:18.1251515Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:18.1251970Z 2025-05-07T20:27:18.1251976Z 2025-05-07T20:27:18.1251981Z 2025-05-07T20:27:18.1251987Z 2025-05-07T20:27:18.1266465Z libblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:18.1266851Z 2025-05-07T20:27:18.1266857Z 2025-05-07T20:27:18.1266862Z 2025-05-07T20:27:18.1266867Z 2025-05-07T20:27:18.1266872Z 2025-05-07T20:27:18.1327348Z libcblas-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:18.1327737Z 2025-05-07T20:27:18.1327743Z 2025-05-07T20:27:18.1327748Z 2025-05-07T20:27:18.1330050Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:27:18.1330459Z 2025-05-07T20:27:18.1330465Z 2025-05-07T20:27:18.1330470Z 2025-05-07T20:27:18.1365378Z libgfortran-15.1.0 | 34 KB | ########## | 100%  2025-05-07T20:27:18.1392958Z numpy-2.2.5 | 8.1 MB | | 0% 2025-05-07T20:27:18.1393300Z 2025-05-07T20:27:18.1393305Z 2025-05-07T20:27:18.1393310Z 2025-05-07T20:27:18.1393326Z 2025-05-07T20:27:18.1393331Z 2025-05-07T20:27:18.1393336Z 2025-05-07T20:27:18.1859382Z liblapack-3.9.0 | 16 KB | ########## | 100%  2025-05-07T20:27:18.1859681Z 2025-05-07T20:27:18.1859688Z 2025-05-07T20:27:18.1973407Z libgfortran5-15.1.0 | 1.5 MB | 1 | 1%  2025-05-07T20:27:18.1973678Z 2025-05-07T20:27:18.2258160Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T20:27:18.2258483Z 2025-05-07T20:27:18.2259160Z 2025-05-07T20:27:18.2365350Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:18.2710173Z numpy-2.2.5 | 8.1 MB | #########5 | 95% 2025-05-07T20:27:18.2784887Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:27:18.2788366Z 2025-05-07T20:27:18.3133279Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:18.3133723Z 2025-05-07T20:27:18.3133740Z 2025-05-07T20:27:18.3137990Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:18.3138362Z 2025-05-07T20:27:18.3138725Z 2025-05-07T20:27:18.4166682Z libgfortran5-15.1.0 | 1.5 MB | ########## | 100%  2025-05-07T20:27:18.4167093Z 2025-05-07T20:27:18.4167413Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:18.4167773Z 2025-05-07T20:27:18.7055665Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T20:27:18.7063549Z numpy-2.2.5 | 8.1 MB | ########## | 100% 2025-05-07T20:27:18.7064137Z 2025-05-07T20:27:18.7064450Z 2025-05-07T20:27:18.7064735Z  2025-05-07T20:27:18.7065054Z 2025-05-07T20:27:18.7065060Z 2025-05-07T20:27:18.7065338Z  2025-05-07T20:27:18.7065679Z 2025-05-07T20:27:18.7065685Z 2025-05-07T20:27:18.7065690Z 2025-05-07T20:27:18.7065948Z  2025-05-07T20:27:18.7066241Z 2025-05-07T20:27:18.7066247Z 2025-05-07T20:27:18.7066253Z 2025-05-07T20:27:18.7066258Z 2025-05-07T20:27:18.7066512Z  2025-05-07T20:27:18.7066734Z 2025-05-07T20:27:18.7066738Z 2025-05-07T20:27:18.7066741Z 2025-05-07T20:27:18.7066745Z 2025-05-07T20:27:18.7066748Z 2025-05-07T20:27:18.7066928Z  2025-05-07T20:27:18.7067230Z 2025-05-07T20:27:18.7067235Z 2025-05-07T20:27:18.7067240Z 2025-05-07T20:27:18.7067246Z 2025-05-07T20:27:18.7067251Z 2025-05-07T20:27:18.7067257Z 2025-05-07T20:27:18.7067540Z  done 2025-05-07T20:27:18.8069844Z Preparing transaction: \ done 2025-05-07T20:27:19.0077997Z Verifying transaction: / - done 2025-05-07T20:27:19.1089411Z Executing transaction: | done 2025-05-07T20:27:19.2947015Z ################################################################################ 2025-05-07T20:27:19.2947433Z # Install Package From PyTorch PIP: torch 2025-05-07T20:27:19.2947747Z # 2025-05-07T20:27:19.2962787Z # [2025-05-07T20:27:19.295Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.6.3 2025-05-07T20:27:19.2963296Z ################################################################################ 2025-05-07T20:27:19.2963523Z 2025-05-07T20:27:19.2978610Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:27:19.3873485Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:27:19.3873843Z ################################################################################ 2025-05-07T20:27:19.3874179Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:27:19.3874463Z # 2025-05-07T20:27:19.3894096Z # [2025-05-07T20:27:19.388Z] + __prepare_pip_arguments torch nightly cuda/12.6.3 2025-05-07T20:27:19.3894871Z ################################################################################ 2025-05-07T20:27:19.3895189Z 2025-05-07T20:27:19.3918297Z [INSTALL] Extracted package (channel, version): (nightly, LATEST) 2025-05-07T20:27:19.3944069Z [INSTALL] Extracted package variant: cu126 2025-05-07T20:27:19.3961677Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:27:19.3962225Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:27:19.3970715Z [INSTALL] Extracted the full PIP package: --pre torch 2025-05-07T20:27:19.3979707Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu126/ ... 2025-05-07T20:27:19.4000658Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:29:06.3999957Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu126/ 2025-05-07T20:29:06.4000578Z Collecting torch 2025-05-07T20:29:06.4001509Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (30 kB) 2025-05-07T20:29:06.4002530Z Collecting filelock (from torch) 2025-05-07T20:29:06.4003230Z Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB) 2025-05-07T20:29:06.4004575Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from torch) (4.13.2) 2025-05-07T20:29:06.4005705Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from torch) (78.1.1) 2025-05-07T20:29:06.4006405Z Collecting sympy>=1.13.3 (from torch) 2025-05-07T20:29:06.4006934Z Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB) 2025-05-07T20:29:06.4007795Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 42.3 MB/s eta 0:00:00 2025-05-07T20:29:06.4008153Z Collecting networkx (from torch) 2025-05-07T20:29:06.4008663Z Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB) 2025-05-07T20:29:06.4009317Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 24.1 MB/s eta 0:00:00 2025-05-07T20:29:06.4009665Z Collecting jinja2 (from torch) 2025-05-07T20:29:06.4010151Z Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB) 2025-05-07T20:29:06.4010667Z Collecting fsspec (from torch) 2025-05-07T20:29:06.4011160Z Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB) 2025-05-07T20:29:06.4011751Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch) 2025-05-07T20:29:06.4012489Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB) 2025-05-07T20:29:06.4013308Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 53.1 MB/s eta 0:00:00 2025-05-07T20:29:06.4014220Z Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch) 2025-05-07T20:29:06.4015111Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (897 kB) 2025-05-07T20:29:06.4015939Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 897.7/897.7 kB 10.3 MB/s eta 0:00:00 2025-05-07T20:29:06.4016437Z Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch) 2025-05-07T20:29:06.4017263Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.whl (8.9 MB) 2025-05-07T20:29:06.4018128Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.9/8.9 MB 41.5 MB/s eta 0:00:00 2025-05-07T20:29:06.4018605Z Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch) 2025-05-07T20:29:06.4019540Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB) 2025-05-07T20:29:06.4020375Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 571.0/571.0 MB 37.3 MB/s eta 0:00:00 2025-05-07T20:29:06.4020774Z Collecting nvidia-cublas-cu12==12.6.4.1 (from torch) 2025-05-07T20:29:06.4021576Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB) 2025-05-07T20:29:06.4022465Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 393.1/393.1 MB 44.6 MB/s eta 0:00:00 2025-05-07T20:29:06.4022859Z Collecting nvidia-cufft-cu12==11.3.0.4 (from torch) 2025-05-07T20:29:06.4023563Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.whl (200.2 MB) 2025-05-07T20:29:06.4024349Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.2/200.2 MB 141.8 MB/s eta 0:00:00 2025-05-07T20:29:06.4024744Z Collecting nvidia-curand-cu12==10.3.7.77 (from torch) 2025-05-07T20:29:06.4025799Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.whl (56.3 MB) 2025-05-07T20:29:06.4026614Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 209.1 MB/s eta 0:00:00 2025-05-07T20:29:06.4027012Z Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch) 2025-05-07T20:29:06.4027739Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.whl (158.2 MB) 2025-05-07T20:29:06.4028549Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 158.2/158.2 MB 148.2 MB/s eta 0:00:00 2025-05-07T20:29:06.4028948Z Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch) 2025-05-07T20:29:06.4029670Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.whl (216.6 MB) 2025-05-07T20:29:06.4030481Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 216.6/216.6 MB 112.6 MB/s eta 0:00:00 2025-05-07T20:29:06.4030883Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch) 2025-05-07T20:29:06.4031618Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-05-07T20:29:06.4032432Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 162.0 MB/s eta 0:00:00 2025-05-07T20:29:06.4032808Z Collecting nvidia-nccl-cu12==2.26.2 (from torch) 2025-05-07T20:29:06.4033607Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB) 2025-05-07T20:29:06.4034405Z Collecting nvidia-nvtx-cu12==12.6.77 (from torch) 2025-05-07T20:29:06.4035091Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (89 kB) 2025-05-07T20:29:06.4035792Z Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch) 2025-05-07T20:29:06.4036613Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB) 2025-05-07T20:29:06.4037504Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.7/19.7 MB 212.3 MB/s eta 0:00:00 2025-05-07T20:29:06.4038050Z Collecting nvidia-cufile-cu12==1.11.1.6 (from torch) 2025-05-07T20:29:06.4038871Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T20:29:06.4039711Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch) 2025-05-07T20:29:06.4040586Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB) 2025-05-07T20:29:06.4041446Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch) 2025-05-07T20:29:06.4042021Z Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-05-07T20:29:06.4042668Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 52.2 MB/s eta 0:00:00 2025-05-07T20:29:06.4043195Z Collecting MarkupSafe>=2.0 (from jinja2->torch) 2025-05-07T20:29:06.4043993Z Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28 kB) 2025-05-07T20:29:06.4045091Z Downloading https://download.pytorch.org/whl/nightly/cu126/torch-2.8.0.dev20250507%2Bcu126-cp312-cp312-manylinux_2_28_x86_64.whl (825.4 MB) 2025-05-07T20:29:06.4045919Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 825.4/825.4 MB 16.7 MB/s eta 0:00:00 2025-05-07T20:29:06.4046714Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB) 2025-05-07T20:29:06.4047593Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 80.1 MB/s eta 0:00:00 2025-05-07T20:29:06.4048376Z Downloading https://download.pytorch.org/whl/nightly/cu126/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-05-07T20:29:06.4049254Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 149.8 MB/s eta 0:00:00 2025-05-07T20:29:06.4050080Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB) 2025-05-07T20:29:06.4051012Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 135.1 MB/s eta 0:00:00 2025-05-07T20:29:06.4053012Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch 2025-05-07T20:29:06.4054963Z 2025-05-07T20:29:06.4057064Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu126 2025-05-07T20:29:06.4062207Z 2025-05-07T20:29:08.6435981Z torch 2.8.0.dev20250507+cu126 2025-05-07T20:29:08.6438484Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu126) 2025-05-07T20:29:12.1665854Z [CHECK] Python (sub-)package 'torch.distributed' found ... 2025-05-07T20:29:15.6827355Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu126 2025-05-07T20:29:15.6827968Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ... 2025-05-07T20:29:19.1167692Z True 2025-05-07T20:29:19.1167938Z True 2025-05-07T20:29:19.1168044Z 2025-05-07T20:29:19.1824714Z [INSTALL] Successfully installed PyTorch through PyTorch PIP 2025-05-07T20:29:19.1871749Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:29:19.1872360Z if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi 2025-05-07T20:29:19.1886770Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:19.1887136Z env: 2025-05-07T20:29:19.1887373Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:19.1887685Z BUILD_ENV: build_binary 2025-05-07T20:29:19.1887944Z BUILD_TARGET: genai 2025-05-07T20:29:19.1888188Z BUILD_VARIANT: cuda 2025-05-07T20:29:19.1888430Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:29:19.1888723Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:19.1889082Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:19.1889428Z ##[endgroup] 2025-05-07T20:29:19.5287342Z /home/ec2-user/miniconda/bin/conda 2025-05-07T20:29:19.5289439Z ################################################################################ 2025-05-07T20:29:19.5289958Z # Collect PyTorch Environment Information (for Reporting Issues) 2025-05-07T20:29:19.5290351Z # 2025-05-07T20:29:19.5305224Z # [2025-05-07T20:29:19.530Z] + collect_pytorch_env_info build_binary 2025-05-07T20:29:19.5305645Z ################################################################################ 2025-05-07T20:29:19.5305876Z 2025-05-07T20:29:19.5322213Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:29:19.6259414Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:29:19.6270285Z [INFO] Downloading the PyTorch environment info collection script ... 2025-05-07T20:29:19.6271008Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py 2025-05-07T20:29:19.6271423Z 2025-05-07T20:29:19.7131240Z 2025-05-07T20:29:19.7132255Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ... 2025-05-07T20:29:19.7156057Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python collect_env.py 2025-05-07T20:29:25.7713902Z Collecting environment information... 2025-05-07T20:29:25.7714361Z PyTorch version: 2.8.0.dev20250507+cu126 2025-05-07T20:29:25.7714655Z Is debug build: False 2025-05-07T20:29:25.7714905Z CUDA used to build PyTorch: 12.6 2025-05-07T20:29:25.7715189Z ROCM used to build PyTorch: N/A 2025-05-07T20:29:25.7715366Z 2025-05-07T20:29:25.7715468Z OS: Amazon Linux 2023.6.20250317 (x86_64) 2025-05-07T20:29:25.7715797Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0 2025-05-07T20:29:25.7716127Z Clang version: Could not collect 2025-05-07T20:29:25.7716398Z CMake version: Could not collect 2025-05-07T20:29:25.7716669Z Libc version: glibc-2.34 2025-05-07T20:29:25.7716831Z 2025-05-07T20:29:25.7717143Z Python version: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0] (64-bit runtime) 2025-05-07T20:29:25.7717777Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34 2025-05-07T20:29:25.7718195Z Is CUDA available: True 2025-05-07T20:29:25.7718844Z CUDA runtime version: 12.6.85 2025-05-07T20:29:25.7719129Z CUDA_MODULE_LOADING set to: LAZY 2025-05-07T20:29:25.7719440Z GPU models and configuration: GPU 0: NVIDIA A10G 2025-05-07T20:29:25.7719784Z Nvidia driver version: 570.133.07 2025-05-07T20:29:25.7720073Z cuDNN version: Could not collect 2025-05-07T20:29:25.7720348Z HIP runtime version: N/A 2025-05-07T20:29:25.7720606Z MIOpen runtime version: N/A 2025-05-07T20:29:25.7720875Z Is XNNPACK available: True 2025-05-07T20:29:25.7721038Z 2025-05-07T20:29:25.7721124Z CPU: 2025-05-07T20:29:25.7721339Z Architecture: x86_64 2025-05-07T20:29:25.7721685Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:29:25.7722094Z Address sizes: 48 bits physical, 48 bits virtual 2025-05-07T20:29:25.7722489Z Byte Order: Little Endian 2025-05-07T20:29:25.7722824Z CPU(s): 16 2025-05-07T20:29:25.7723132Z On-line CPU(s) list: 0-15 2025-05-07T20:29:25.7723690Z Vendor ID: AuthenticAMD 2025-05-07T20:29:25.7724039Z Model name: AMD EPYC 7R32 2025-05-07T20:29:25.7724374Z CPU family: 23 2025-05-07T20:29:25.7724666Z Model: 49 2025-05-07T20:29:25.7724955Z Thread(s) per core: 2 2025-05-07T20:29:25.7725251Z Core(s) per socket: 8 2025-05-07T20:29:25.7725817Z Socket(s): 1 2025-05-07T20:29:25.7726099Z Stepping: 0 2025-05-07T20:29:25.7726402Z BogoMIPS: 5599.29 2025-05-07T20:29:25.7728612Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:29:25.7730833Z Hypervisor vendor: KVM 2025-05-07T20:29:25.7731152Z Virtualization type: full 2025-05-07T20:29:25.7731496Z L1d cache: 256 KiB (8 instances) 2025-05-07T20:29:25.7731876Z L1i cache: 256 KiB (8 instances) 2025-05-07T20:29:25.7732254Z L2 cache: 4 MiB (8 instances) 2025-05-07T20:29:25.7732620Z L3 cache: 32 MiB (2 instances) 2025-05-07T20:29:25.7732946Z NUMA node(s): 1 2025-05-07T20:29:25.7733250Z NUMA node0 CPU(s): 0-15 2025-05-07T20:29:25.7733592Z Vulnerability Gather data sampling: Not affected 2025-05-07T20:29:25.7733974Z Vulnerability Itlb multihit: Not affected 2025-05-07T20:29:25.7734351Z Vulnerability L1tf: Not affected 2025-05-07T20:29:25.7734822Z Vulnerability Mds: Not affected 2025-05-07T20:29:25.7735181Z Vulnerability Meltdown: Not affected 2025-05-07T20:29:25.7735551Z Vulnerability Mmio stale data: Not affected 2025-05-07T20:29:25.7735927Z Vulnerability Reg file data sampling: Not affected 2025-05-07T20:29:25.7736487Z Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection 2025-05-07T20:29:25.7737079Z Vulnerability Spec rstack overflow: Mitigation; safe RET 2025-05-07T20:29:25.7737636Z Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl 2025-05-07T20:29:25.7738345Z Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization 2025-05-07T20:29:25.7739229Z Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected 2025-05-07T20:29:25.7740083Z Vulnerability Srbds: Not affected 2025-05-07T20:29:25.7740462Z Vulnerability Tsx async abort: Not affected 2025-05-07T20:29:25.7740700Z 2025-05-07T20:29:25.7740812Z Versions of relevant libraries: 2025-05-07T20:29:25.7741080Z [pip3] numpy==2.2.5 2025-05-07T20:29:25.7741331Z [pip3] nvidia-cublas-cu12==12.6.4.1 2025-05-07T20:29:25.7741646Z [pip3] nvidia-cuda-cupti-cu12==12.6.80 2025-05-07T20:29:25.7741964Z [pip3] nvidia-cuda-nvrtc-cu12==12.6.77 2025-05-07T20:29:25.7742288Z [pip3] nvidia-cuda-runtime-cu12==12.6.77 2025-05-07T20:29:25.7742627Z [pip3] nvidia-cudnn-cu12==9.5.1.17 2025-05-07T20:29:25.7742927Z [pip3] nvidia-cufft-cu12==11.3.0.4 2025-05-07T20:29:25.7743222Z [pip3] nvidia-curand-cu12==10.3.7.77 2025-05-07T20:29:25.7743532Z [pip3] nvidia-cusolver-cu12==11.7.1.2 2025-05-07T20:29:25.7743851Z [pip3] nvidia-cusparse-cu12==12.5.4.2 2025-05-07T20:29:25.7745063Z [pip3] nvidia-cusparselt-cu12==0.6.3 2025-05-07T20:29:25.7745391Z [pip3] nvidia-nccl-cu12==2.26.2 2025-05-07T20:29:25.7745685Z [pip3] nvidia-nvjitlink-cu12==12.6.85 2025-05-07T20:29:25.7745992Z [pip3] nvidia-nvtx-cu12==12.6.77 2025-05-07T20:29:25.7746292Z [pip3] pytorch-triton==3.3.0+git96316ce5 2025-05-07T20:29:25.7746607Z [pip3] torch==2.8.0.dev20250507+cu126 2025-05-07T20:29:25.7746984Z [conda] cuda-cudart 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:29:25.7747489Z [conda] cuda-cudart-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:29:25.7748026Z [conda] cuda-cudart-dev_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:29:25.7748573Z [conda] cuda-cudart-static 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:29:25.7749143Z [conda] cuda-cudart-static_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:29:25.7749728Z [conda] cuda-cudart_linux-64 12.6.77 h3f2d84a_0 conda-forge 2025-05-07T20:29:25.7750243Z [conda] cuda-cupti 12.6.80 hbd13f7d_0 conda-forge 2025-05-07T20:29:25.7750720Z [conda] cuda-cupti-dev 12.6.80 h5888daf_0 conda-forge 2025-05-07T20:29:25.7751224Z [conda] cuda-libraries 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:29:25.7751738Z [conda] cuda-libraries-dev 12.6.3 ha770c72_0 conda-forge 2025-05-07T20:29:25.7752233Z [conda] cuda-nvrtc 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:29:25.7752706Z [conda] cuda-nvrtc-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:29:25.7753182Z [conda] cuda-nvtx 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:29:25.7753652Z [conda] cuda-opencl 12.6.77 hbd13f7d_0 conda-forge 2025-05-07T20:29:25.7754145Z [conda] cuda-opencl-dev 12.6.77 h5888daf_0 conda-forge 2025-05-07T20:29:25.7754636Z [conda] cuda-runtime 12.6.3 ha804496_0 conda-forge 2025-05-07T20:29:25.7755114Z [conda] libcublas 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:29:25.7755597Z [conda] libcublas-dev 12.6.4.1 h5888daf_1 conda-forge 2025-05-07T20:29:25.7756066Z [conda] libcufft 11.3.0.4 hbd13f7d_0 conda-forge 2025-05-07T20:29:25.7756539Z [conda] libcufft-dev 11.3.0.4 h5888daf_0 conda-forge 2025-05-07T20:29:25.7757015Z [conda] libcurand 10.3.7.77 hbd13f7d_0 conda-forge 2025-05-07T20:29:25.7757501Z [conda] libcurand-dev 10.3.7.77 h5888daf_0 conda-forge 2025-05-07T20:29:25.7757983Z [conda] libcusolver 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:29:25.7758475Z [conda] libcusolver-dev 11.7.1.2 h5888daf_1 conda-forge 2025-05-07T20:29:25.7758975Z [conda] libcusparse 12.5.4.2 hbd13f7d_0 conda-forge 2025-05-07T20:29:25.7759566Z [conda] libcusparse-dev 12.5.4.2 h5888daf_0 conda-forge 2025-05-07T20:29:25.7760066Z [conda] libnvjitlink 12.6.85 hbd13f7d_0 conda-forge 2025-05-07T20:29:25.7760567Z [conda] libnvjitlink-dev 12.6.85 h5888daf_0 conda-forge 2025-05-07T20:29:25.7761041Z [conda] numpy 2.2.5 py312h72c5963_0 conda-forge 2025-05-07T20:29:25.7761504Z [conda] nvidia-cublas-cu12 12.6.4.1 pypi_0 pypi 2025-05-07T20:29:25.7762023Z [conda] nvidia-cuda-cupti-cu12 12.6.80 pypi_0 pypi 2025-05-07T20:29:25.7762539Z [conda] nvidia-cuda-nvrtc-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:29:25.7763055Z [conda] nvidia-cuda-runtime-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:29:25.7763570Z [conda] nvidia-cudnn-cu12 9.5.1.17 pypi_0 pypi 2025-05-07T20:29:25.7764153Z [conda] nvidia-cufft-cu12 11.3.0.4 pypi_0 pypi 2025-05-07T20:29:25.7764655Z [conda] nvidia-curand-cu12 10.3.7.77 pypi_0 pypi 2025-05-07T20:29:25.7765153Z [conda] nvidia-cusolver-cu12 11.7.1.2 pypi_0 pypi 2025-05-07T20:29:25.7765667Z [conda] nvidia-cusparse-cu12 12.5.4.2 pypi_0 pypi 2025-05-07T20:29:25.7766184Z [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi 2025-05-07T20:29:25.7766690Z [conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi 2025-05-07T20:29:25.7767189Z [conda] nvidia-nvjitlink-cu12 12.6.85 pypi_0 pypi 2025-05-07T20:29:25.7767689Z [conda] nvidia-nvtx-cu12 12.6.77 pypi_0 pypi 2025-05-07T20:29:25.7768181Z [conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi 2025-05-07T20:29:25.7768656Z [conda] torch 2.8.0.dev20250507+cu126 pypi_0 pypi 2025-05-07T20:29:25.7768945Z 2025-05-07T20:29:25.8502914Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:29:25.8503591Z . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV 2025-05-07T20:29:25.8516287Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:25.8516645Z env: 2025-05-07T20:29:25.8516872Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:25.8517172Z BUILD_ENV: build_binary 2025-05-07T20:29:25.8517425Z BUILD_TARGET: genai 2025-05-07T20:29:25.8517659Z BUILD_VARIANT: cuda 2025-05-07T20:29:25.8517891Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:29:25.8518141Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:25.8518445Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:25.8518780Z ##[endgroup] 2025-05-07T20:29:26.1929420Z ################################################################################ 2025-05-07T20:29:26.1929784Z # Prepare FBGEMM-GPU Build 2025-05-07T20:29:26.1930047Z # 2025-05-07T20:29:26.1945096Z # [2025-05-07T20:29:26.194Z] + prepare_fbgemm_gpu_build build_binary 2025-05-07T20:29:26.1945531Z ################################################################################ 2025-05-07T20:29:26.1945756Z 2025-05-07T20:29:26.1960802Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:29:26.2836302Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:29:26.2858957Z [BUILD] Running git submodules update ... 2025-05-07T20:29:26.2881333Z [EXEC] [ATTEMPT 0/3] + git submodule sync 2025-05-07T20:29:26.3247072Z Synchronizing submodule url for '../external/asmjit' 2025-05-07T20:29:26.3247738Z Synchronizing submodule url for '../external/composable_kernel' 2025-05-07T20:29:26.3248295Z Synchronizing submodule url for '../external/cpuinfo' 2025-05-07T20:29:26.3248706Z Synchronizing submodule url for '../external/cutlass' 2025-05-07T20:29:26.3249543Z Synchronizing submodule url for '../external/googletest' 2025-05-07T20:29:26.3250461Z Synchronizing submodule url for '../external/hipify_torch' 2025-05-07T20:29:26.3251939Z Synchronizing submodule url for '../external/json' 2025-05-07T20:29:26.3285814Z [EXEC] [ATTEMPT 0/3] + git submodule update --init --recursive 2025-05-07T20:29:26.3839886Z [BUILD] Installing other build dependencies ... 2025-05-07T20:29:26.3860717Z [EXEC] [ATTEMPT 0/3] + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt 2025-05-07T20:29:28.8227272Z Collecting backports.tarfile (from -r requirements.txt (line 13)) 2025-05-07T20:29:28.8409107Z Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB) 2025-05-07T20:29:28.9558825Z Collecting build (from -r requirements.txt (line 14)) 2025-05-07T20:29:28.9590259Z Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB) 2025-05-07T20:29:29.1764478Z Collecting cmake (from -r requirements.txt (line 15)) 2025-05-07T20:29:29.1795734Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB) 2025-05-07T20:29:29.3078434Z Collecting click (from -r requirements.txt (line 16)) 2025-05-07T20:29:29.3110468Z Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB) 2025-05-07T20:29:29.6443770Z Collecting hypothesis (from -r requirements.txt (line 17)) 2025-05-07T20:29:29.6481607Z Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB) 2025-05-07T20:29:29.7048921Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 18)) (3.1.4) 2025-05-07T20:29:29.7062335Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 19)) (1.3.0) 2025-05-07T20:29:29.7882306Z Collecting ninja (from -r requirements.txt (line 20)) 2025-05-07T20:29:29.7916841Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB) 2025-05-07T20:29:29.8410328Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 21)) (2.2.5) 2025-05-07T20:29:29.8918635Z Collecting pyre-extensions (from -r requirements.txt (line 22)) 2025-05-07T20:29:29.8968322Z Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB) 2025-05-07T20:29:29.9967885Z Collecting pyyaml (from -r requirements.txt (line 23)) 2025-05-07T20:29:29.9999260Z Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB) 2025-05-07T20:29:30.0595691Z Collecting scikit-build (from -r requirements.txt (line 24)) 2025-05-07T20:29:30.0645584Z Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB) 2025-05-07T20:29:30.1024230Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from -r requirements.txt (line 25)) (78.1.1) 2025-05-07T20:29:30.1543787Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26)) 2025-05-07T20:29:30.1593506Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB) 2025-05-07T20:29:30.2341279Z Collecting tabulate (from -r requirements.txt (line 27)) 2025-05-07T20:29:30.2368880Z Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB) 2025-05-07T20:29:30.3345544Z Collecting patchelf (from -r requirements.txt (line 28)) 2025-05-07T20:29:30.3388984Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB) 2025-05-07T20:29:30.4205339Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14)) 2025-05-07T20:29:30.4237544Z Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB) 2025-05-07T20:29:30.4817996Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14)) 2025-05-07T20:29:30.4845582Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB) 2025-05-07T20:29:30.5587807Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:29:30.5614809Z Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB) 2025-05-07T20:29:30.6429170Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17)) 2025-05-07T20:29:30.6460768Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB) 2025-05-07T20:29:30.6898395Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5) 2025-05-07T20:29:30.7276123Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:29:30.7303982Z Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB) 2025-05-07T20:29:30.7697566Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2) 2025-05-07T20:29:30.8331972Z Collecting distro (from scikit-build->-r requirements.txt (line 24)) 2025-05-07T20:29:30.8359701Z Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB) 2025-05-07T20:29:30.8837588Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1) 2025-05-07T20:29:30.9520898Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T20:29:30.9549032Z Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB) 2025-05-07T20:29:31.0062888Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB) 2025-05-07T20:29:31.0679424Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB) 2025-05-07T20:29:31.1287444Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB) 2025-05-07T20:29:31.8138086Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 40.9 MB/s eta 0:00:00 2025-05-07T20:29:31.8170201Z Downloading click-8.1.8-py3-none-any.whl (98 kB) 2025-05-07T20:29:31.8666036Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB) 2025-05-07T20:29:31.9199044Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB) 2025-05-07T20:29:31.9647748Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB) 2025-05-07T20:29:32.0276661Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB) 2025-05-07T20:29:32.0781737Z Downloading PyYAML-6.0.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (767 kB) 2025-05-07T20:29:32.1429245Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 767.5/767.5 kB 8.2 MB/s eta 0:00:00 2025-05-07T20:29:32.1479667Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB) 2025-05-07T20:29:32.1982548Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB) 2025-05-07T20:29:32.2482030Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB) 2025-05-07T20:29:32.2966515Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB) 2025-05-07T20:29:32.3561869Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB) 2025-05-07T20:29:32.4071258Z Downloading packaging-25.0-py3-none-any.whl (66 kB) 2025-05-07T20:29:32.4561822Z Downloading distro-1.9.0-py3-none-any.whl (20 kB) 2025-05-07T20:29:32.5044089Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB) 2025-05-07T20:29:32.5544422Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-05-07T20:29:32.6029806Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-05-07T20:29:32.7704454Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions 2025-05-07T20:29:35.0440491Z 2025-05-07T20:29:35.0486403Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0 2025-05-07T20:29:35.2316330Z ################################################################################ 2025-05-07T20:29:35.2316731Z # Install PyTorch (PyTorch PIP) 2025-05-07T20:29:35.2317003Z # 2025-05-07T20:29:35.2334936Z # [2025-05-07T20:29:35.233Z] + install_triton_pip build_binary 2025-05-07T20:29:35.2335340Z ################################################################################ 2025-05-07T20:29:35.2335561Z 2025-05-07T20:29:35.2335797Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ... 2025-05-07T20:29:35.2336249Z ################################################################################ 2025-05-07T20:29:35.2336630Z # Install Package From PyTorch PIP: pytorch-triton 2025-05-07T20:29:35.2336966Z # 2025-05-07T20:29:35.2352029Z # [2025-05-07T20:29:35.234Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:29:35.2352579Z ################################################################################ 2025-05-07T20:29:35.2352808Z 2025-05-07T20:29:35.2372259Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:29:35.3239872Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:29:35.3240325Z ################################################################################ 2025-05-07T20:29:35.3240673Z # Prepare PIP Arguments (PyTorch PIP) 2025-05-07T20:29:35.3240970Z # 2025-05-07T20:29:35.3260206Z # [2025-05-07T20:29:35.325Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 2025-05-07T20:29:35.3260714Z ################################################################################ 2025-05-07T20:29:35.3260939Z 2025-05-07T20:29:35.3309254Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8) 2025-05-07T20:29:35.3325704Z [INSTALL] Using a non-RELEASE channel: nightly ... 2025-05-07T20:29:35.3326264Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:29:35.3334929Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:29:35.3361570Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ... 2025-05-07T20:29:35.3382761Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/ 2025-05-07T20:29:42.8809239Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 2025-05-07T20:29:42.8810927Z torch 2.8.0.dev20250507+cu126 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux" and platform_machine == "x86_64", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible. 2025-05-07T20:29:42.8811724Z 2025-05-07T20:29:42.8811953Z Looking in indexes: https://download.pytorch.org/whl/nightly/ 2025-05-07T20:29:42.8812386Z Collecting pytorch-triton==3.2.0+git4b3bb1f8 2025-05-07T20:29:42.8813222Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB) 2025-05-07T20:29:42.8814500Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB) 2025-05-07T20:29:42.8815790Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 59.2 MB/s eta 0:00:00 2025-05-07T20:29:42.8816186Z Installing collected packages: pytorch-triton 2025-05-07T20:29:42.8816539Z Attempting uninstall: pytorch-triton 2025-05-07T20:29:42.8816936Z Found existing installation: pytorch-triton 3.3.0+git96316ce5 2025-05-07T20:29:42.8817372Z Uninstalling pytorch-triton-3.3.0+git96316ce5: 2025-05-07T20:29:42.8818870Z Successfully uninstalled pytorch-triton-3.3.0+git96316ce5 2025-05-07T20:29:42.8819324Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8 2025-05-07T20:29:42.8819604Z 2025-05-07T20:29:45.1334457Z [CHECK] Python (sub-)package 'triton' found ... 2025-05-07T20:29:45.1338343Z [CHECK] Printing out the pytorch-triton version ... 2025-05-07T20:29:47.3080231Z ################################################################################ 2025-05-07T20:29:47.3080708Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0 2025-05-07T20:29:47.3081105Z ################################################################################ 2025-05-07T20:29:47.3081330Z 2025-05-07T20:29:49.3990388Z [CHECK] Python (sub-)package 'numpy' found ... 2025-05-07T20:29:51.5928999Z [CHECK] Python (sub-)package 'skbuild' found ... 2025-05-07T20:29:51.5933199Z [BUILD] Successfully ran git submodules update 2025-05-07T20:29:51.5970662Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:29:51.5971187Z . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl 2025-05-07T20:29:51.5983926Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:51.5984281Z env: 2025-05-07T20:29:51.5984512Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:29:51.5984812Z BUILD_ENV: build_binary 2025-05-07T20:29:51.5985058Z BUILD_TARGET: genai 2025-05-07T20:29:51.5985291Z BUILD_VARIANT: cuda 2025-05-07T20:29:51.5985527Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:29:51.5985778Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:29:51.5986078Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:29:51.5986415Z ##[endgroup] 2025-05-07T20:29:51.9377840Z ################################################################################ 2025-05-07T20:29:51.9378246Z # Install FBGEMM-GPU from Wheel 2025-05-07T20:29:51.9378525Z # 2025-05-07T20:29:51.9394312Z # [2025-05-07T20:29:51.939Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:51.9394987Z ################################################################################ 2025-05-07T20:29:51.9395213Z 2025-05-07T20:29:51.9395590Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:51.9396362Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:51.9396717Z 2025-05-07T20:29:51.9515750Z b58dd3e4c726c265422746de0dfe912f1de4c20c fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:51.9518593Z 2025-05-07T20:29:51.9519148Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:51.9519530Z 2025-05-07T20:29:51.9649502Z e43258215d51ee2f91c736eb424ad291b450bb2c2463b8d99c2ae36a64a4ffa7 fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:51.9650804Z 2025-05-07T20:29:51.9651151Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:51.9651530Z 2025-05-07T20:29:51.9885860Z 616cc1b2508efed22f2eda95309a712f fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:51.9887674Z 2025-05-07T20:29:51.9897480Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl ... 2025-05-07T20:29:51.9919341Z [EXEC] [ATTEMPT 0/3] + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:54.6861201Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp312-cp312-manylinux_2_28_x86_64.whl 2025-05-07T20:29:54.6862200Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5) 2025-05-07T20:29:54.6863103Z Installing collected packages: fbgemm-gpu-genai-nightly 2025-05-07T20:29:54.6863559Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7 2025-05-07T20:29:54.6864179Z 2025-05-07T20:30:01.7024820Z ################################################################################ 2025-05-07T20:30:01.7025228Z [CHECK] !!!! INFO !!!! 2025-05-07T20:30:01.7025856Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu126 2025-05-07T20:30:01.7026301Z [CHECK] CUDA version reported by PyTorch is: 12.6 2025-05-07T20:30:01.7026635Z [CHECK] 2025-05-07T20:30:01.7026979Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU 2025-05-07T20:30:01.7027492Z [CHECK] package channel; the package may be broken at runtime!!! 2025-05-07T20:30:01.7027896Z ################################################################################ 2025-05-07T20:30:01.7028121Z 2025-05-07T20:30:01.7028239Z [INSTALL] Checking imports and symbols ... 2025-05-07T20:30:05.7583207Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:30:09.7974550Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'. 2025-05-07T20:30:13.8271569Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'. 2025-05-07T20:30:13.8276346Z [CHECK] Printing out the FBGEMM-GPU version ... 2025-05-07T20:30:25.9001684Z ################################################################################ 2025-05-07T20:30:25.9005093Z [CHECK] The installed FBGEMM TARGET is: genai 2025-05-07T20:30:25.9005480Z [CHECK] The installed FBGEMM VARIANT is: cuda 2025-05-07T20:30:25.9005845Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7 2025-05-07T20:30:25.9006204Z ################################################################################ 2025-05-07T20:30:25.9006431Z 2025-05-07T20:30:33.9774976Z ################################################################################ 2025-05-07T20:30:33.9775716Z [CHECK] FBGEMM_GPU Experimental Packages 2025-05-07T20:30:33.9777593Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils'] 2025-05-07T20:30:33.9779229Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 2025-05-07T20:30:33.9779775Z ################################################################################ 2025-05-07T20:30:33.9780005Z 2025-05-07T20:30:33.9780170Z [INSTALL] Check for installation of Python sources ... 2025-05-07T20:30:38.0356945Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ... 2025-05-07T20:30:42.0902587Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ... 2025-05-07T20:30:46.2409266Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ... 2025-05-07T20:30:50.2797485Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ... 2025-05-07T20:30:50.2802521Z [INSTALL] Check for operator registrations ... 2025-05-07T20:30:54.2612897Z fbgemm.nccl_init 2025-05-07T20:30:54.2615078Z 2025-05-07T20:30:54.3252338Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init 2025-05-07T20:30:58.3139930Z fbgemm.gqa_attn_splitk 2025-05-07T20:30:58.3140147Z 2025-05-07T20:30:58.3823991Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk 2025-05-07T20:31:02.3342479Z fbgemm.rope_qkv_decoding 2025-05-07T20:31:02.3342696Z 2025-05-07T20:31:02.3996843Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding 2025-05-07T20:31:02.3997687Z [INSTALL] FBGEMM-GPU installation through wheel completed ... 2025-05-07T20:31:02.4033134Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:31:02.4033603Z . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV 2025-05-07T20:31:02.4046896Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:31:02.4047471Z env: 2025-05-07T20:31:02.4047704Z PRELUDE: .github/scripts/setup_env.bash 2025-05-07T20:31:02.4048019Z BUILD_ENV: build_binary 2025-05-07T20:31:02.4048277Z BUILD_TARGET: genai 2025-05-07T20:31:02.4048519Z BUILD_VARIANT: cuda 2025-05-07T20:31:02.4048761Z BUILD_CUDA_VERSION: 12.6.3 2025-05-07T20:31:02.4049037Z ENFORCE_CUDA_DEVICE: 1 2025-05-07T20:31:02.4049350Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-05-07T20:31:02.4049691Z ##[endgroup] 2025-05-07T20:31:02.7463126Z ################################################################################ 2025-05-07T20:31:02.7463676Z # Test All FBGEMM-GPU Modules 2025-05-07T20:31:02.7463946Z # 2025-05-07T20:31:02.7471492Z # [2025-05-07T20:31:02.746Z] + test_all_fbgemm_gpu_modules build_binary 2025-05-07T20:31:02.7471941Z ################################################################################ 2025-05-07T20:31:02.7472171Z 2025-05-07T20:31:10.8133719Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda) 2025-05-07T20:31:10.8134737Z [TEST] Will be running tests specific to this target and variant ... 2025-05-07T20:31:10.8135285Z [TEST] Determined the test directories: 2025-05-07T20:31:10.8135614Z fbgemm_gpu/experimental/gen_ai/test 2025-05-07T20:31:10.8135924Z fbgemm_gpu/experimental/example/test 2025-05-07T20:31:10.8136236Z fbgemm_gpu/experimental/gemm/test 2025-05-07T20:31:10.8136431Z 2025-05-07T20:31:10.8144147Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ... 2025-05-07T20:31:10.8151377Z [TEST] Set environment variables for CUDA testing ... 2025-05-07T20:31:10.8152020Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES 2025-05-07T20:31:10.8152446Z 2025-05-07T20:31:11.2464508Z 2025-05-07T20:31:11.2465000Z [TEST] Installing PyTest ... 2025-05-07T20:31:11.2488273Z [EXEC] [ATTEMPT 0/3] + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest 2025-05-07T20:31:12.3506392Z Channels: 2025-05-07T20:31:12.3506774Z - conda-forge 2025-05-07T20:31:12.3507087Z Platform: linux-64 2025-05-07T20:31:15.7437143Z Collecting package metadata (repodata.json): - \ | / done 2025-05-07T20:31:16.9076739Z Solving environment: \ | / done 2025-05-07T20:31:17.1356710Z 2025-05-07T20:31:17.1356992Z ## Package Plan ## 2025-05-07T20:31:17.1357186Z 2025-05-07T20:31:17.1357402Z environment location: /home/ec2-user/miniconda/envs/build_binary 2025-05-07T20:31:17.1357724Z 2025-05-07T20:31:17.1357829Z added / updated specs: 2025-05-07T20:31:17.1358102Z - expecttest 2025-05-07T20:31:17.1358332Z - pytest 2025-05-07T20:31:17.1358474Z 2025-05-07T20:31:17.1358479Z 2025-05-07T20:31:17.1358609Z The following packages will be downloaded: 2025-05-07T20:31:17.1358854Z 2025-05-07T20:31:17.1358986Z package | build 2025-05-07T20:31:17.1359336Z ---------------------------|----------------- 2025-05-07T20:31:17.1359735Z colorama-0.4.6 | pyhd8ed1ab_1 26 KB conda-forge 2025-05-07T20:31:17.1360239Z exceptiongroup-1.2.2 | pyhd8ed1ab_1 20 KB conda-forge 2025-05-07T20:31:17.1360729Z expecttest-0.3.0 | pyhd8ed1ab_0 14 KB conda-forge 2025-05-07T20:31:17.1361185Z iniconfig-2.0.0 | pyhd8ed1ab_1 11 KB conda-forge 2025-05-07T20:31:17.1361644Z packaging-25.0 | pyh29332c3_1 61 KB conda-forge 2025-05-07T20:31:17.1362093Z pluggy-1.5.0 | pyhd8ed1ab_1 23 KB conda-forge 2025-05-07T20:31:17.1363939Z pytest-8.3.5 | pyhd8ed1ab_0 254 KB conda-forge 2025-05-07T20:31:17.1364658Z tomli-2.2.1 | pyhd8ed1ab_1 19 KB conda-forge 2025-05-07T20:31:17.1365075Z ------------------------------------------------------------ 2025-05-07T20:31:17.1365445Z Total: 428 KB 2025-05-07T20:31:17.1365667Z 2025-05-07T20:31:17.1365803Z The following NEW packages will be INSTALLED: 2025-05-07T20:31:17.1366194Z 2025-05-07T20:31:17.1366404Z colorama conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 2025-05-07T20:31:17.1366947Z exceptiongroup conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 2025-05-07T20:31:17.1367497Z expecttest conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 2025-05-07T20:31:17.1367996Z iniconfig conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 2025-05-07T20:31:17.1368540Z packaging conda-forge/noarch::packaging-25.0-pyh29332c3_1 2025-05-07T20:31:17.1369020Z pluggy conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 2025-05-07T20:31:17.1369486Z pytest conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 2025-05-07T20:31:17.1369930Z tomli conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 2025-05-07T20:31:17.1370207Z 2025-05-07T20:31:17.1370211Z 2025-05-07T20:31:17.1370215Z 2025-05-07T20:31:17.1370369Z Downloading and Extracting Packages: ...working... 2025-05-07T20:31:17.1370780Z pytest-8.3.5 | 254 KB | | 0% 2025-05-07T20:31:17.1371018Z 2025-05-07T20:31:17.1371409Z packaging-25.0 | 61 KB | | 0%  2025-05-07T20:31:17.1371658Z 2025-05-07T20:31:17.1371661Z 2025-05-07T20:31:17.1397132Z colorama-0.4.6 | 26 KB | | 0%  2025-05-07T20:31:17.1397399Z 2025-05-07T20:31:17.1397403Z 2025-05-07T20:31:17.1400033Z 2025-05-07T20:31:17.1410555Z pluggy-1.5.0 | 23 KB | | 0%  2025-05-07T20:31:17.1410827Z 2025-05-07T20:31:17.1410831Z 2025-05-07T20:31:17.1410834Z 2025-05-07T20:31:17.1424766Z 2025-05-07T20:31:17.1442709Z exceptiongroup-1.2.2 | 20 KB | | 0%  2025-05-07T20:31:17.1443024Z 2025-05-07T20:31:17.1444705Z 2025-05-07T20:31:17.1444711Z 2025-05-07T20:31:17.1444715Z 2025-05-07T20:31:17.1444738Z 2025-05-07T20:31:17.1445371Z tomli-2.2.1 | 19 KB | | 0%  2025-05-07T20:31:17.1445758Z 2025-05-07T20:31:17.1445764Z 2025-05-07T20:31:17.1445770Z 2025-05-07T20:31:17.1445775Z 2025-05-07T20:31:17.1445780Z 2025-05-07T20:31:17.1445800Z 2025-05-07T20:31:17.1454461Z expecttest-0.3.0 | 14 KB | | 0%  2025-05-07T20:31:17.1454902Z 2025-05-07T20:31:17.1454907Z 2025-05-07T20:31:17.1454910Z 2025-05-07T20:31:17.1454923Z 2025-05-07T20:31:17.1454927Z 2025-05-07T20:31:17.1454930Z 2025-05-07T20:31:17.1454934Z 2025-05-07T20:31:17.2546442Z iniconfig-2.0.0 | 11 KB | | 0%  2025-05-07T20:31:17.2546767Z 2025-05-07T20:31:17.2546780Z 2025-05-07T20:31:17.2546858Z 2025-05-07T20:31:17.2879963Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:31:17.2880725Z 2025-05-07T20:31:17.2880737Z 2025-05-07T20:31:17.2880747Z 2025-05-07T20:31:17.2880757Z 2025-05-07T20:31:17.2934458Z exceptiongroup-1.2.2 | 20 KB | #######9 | 80%  2025-05-07T20:31:17.2935004Z 2025-05-07T20:31:17.2935019Z 2025-05-07T20:31:17.3030576Z colorama-0.4.6 | 26 KB | ###### | 61%  2025-05-07T20:31:17.3030953Z 2025-05-07T20:31:17.3030959Z 2025-05-07T20:31:17.3037839Z 2025-05-07T20:31:17.3076978Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:31:17.3077358Z 2025-05-07T20:31:17.3077365Z 2025-05-07T20:31:17.3077370Z 2025-05-07T20:31:17.3087112Z 2025-05-07T20:31:17.3095001Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:31:17.3095420Z 2025-05-07T20:31:17.3098188Z 2025-05-07T20:31:17.3415208Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:31:17.3415908Z 2025-05-07T20:31:17.3415935Z 2025-05-07T20:31:17.3415945Z 2025-05-07T20:31:17.3415955Z 2025-05-07T20:31:17.3416424Z 2025-05-07T20:31:17.3430944Z tomli-2.2.1 | 19 KB | ########5 | 85%  2025-05-07T20:31:17.3431312Z 2025-05-07T20:31:17.3431328Z 2025-05-07T20:31:17.3431333Z 2025-05-07T20:31:17.3431338Z 2025-05-07T20:31:17.3434506Z 2025-05-07T20:31:17.3470627Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:31:17.3471000Z 2025-05-07T20:31:17.3471005Z 2025-05-07T20:31:17.3471010Z 2025-05-07T20:31:17.3472183Z 2025-05-07T20:31:17.3478683Z exceptiongroup-1.2.2 | 20 KB | ########## | 100%  2025-05-07T20:31:17.3479103Z 2025-05-07T20:31:17.3479108Z 2025-05-07T20:31:17.3479114Z 2025-05-07T20:31:17.3479119Z 2025-05-07T20:31:17.3479124Z 2025-05-07T20:31:17.3479129Z 2025-05-07T20:31:17.3479133Z 2025-05-07T20:31:17.3491400Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:31:17.3491801Z 2025-05-07T20:31:17.3491807Z 2025-05-07T20:31:17.3493515Z 2025-05-07T20:31:17.3498554Z pluggy-1.5.0 | 23 KB | ########## | 100%  2025-05-07T20:31:17.3498921Z 2025-05-07T20:31:17.3498926Z 2025-05-07T20:31:17.3498931Z 2025-05-07T20:31:17.3498936Z 2025-05-07T20:31:17.3498941Z 2025-05-07T20:31:17.3498946Z 2025-05-07T20:31:17.3502481Z 2025-05-07T20:31:17.3540536Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:31:17.3540936Z 2025-05-07T20:31:17.3540941Z 2025-05-07T20:31:17.3540946Z 2025-05-07T20:31:17.3540951Z 2025-05-07T20:31:17.3540956Z 2025-05-07T20:31:17.3540961Z 2025-05-07T20:31:17.3549265Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:31:17.3549672Z 2025-05-07T20:31:17.3549677Z 2025-05-07T20:31:17.3549682Z 2025-05-07T20:31:17.3549687Z 2025-05-07T20:31:17.3549692Z 2025-05-07T20:31:17.3549754Z 2025-05-07T20:31:17.3555832Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:31:17.3556128Z 2025-05-07T20:31:17.3556328Z 2025-05-07T20:31:17.3749012Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:31:17.3806571Z pytest-8.3.5 | 254 KB | 6 | 6% 2025-05-07T20:31:17.3806859Z 2025-05-07T20:31:17.3807043Z 2025-05-07T20:31:17.3807048Z 2025-05-07T20:31:17.3807062Z 2025-05-07T20:31:17.3807067Z 2025-05-07T20:31:17.3807636Z 2025-05-07T20:31:17.3864948Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:31:17.3907280Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:31:17.3907559Z 2025-05-07T20:31:17.3915559Z packaging-25.0 | 61 KB | ##6 | 26%  2025-05-07T20:31:17.3915823Z 2025-05-07T20:31:17.3915827Z 2025-05-07T20:31:17.3915830Z 2025-05-07T20:31:17.3915834Z 2025-05-07T20:31:17.3915838Z 2025-05-07T20:31:17.3921470Z tomli-2.2.1 | 19 KB | ########## | 100%  2025-05-07T20:31:17.3921735Z 2025-05-07T20:31:17.3921739Z 2025-05-07T20:31:17.3921743Z 2025-05-07T20:31:17.3921747Z 2025-05-07T20:31:17.3921750Z 2025-05-07T20:31:17.3921763Z 2025-05-07T20:31:17.3922850Z 2025-05-07T20:31:17.3940819Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:31:17.3941435Z 2025-05-07T20:31:17.4098764Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:31:17.4099164Z 2025-05-07T20:31:17.4229755Z packaging-25.0 | 61 KB | ########## | 100%  2025-05-07T20:31:17.4236740Z pytest-8.3.5 | 254 KB | ########## | 100% 2025-05-07T20:31:17.4237687Z 2025-05-07T20:31:17.4238258Z 2025-05-07T20:31:17.4238538Z  2025-05-07T20:31:17.4238779Z 2025-05-07T20:31:17.4238784Z 2025-05-07T20:31:17.4238965Z  2025-05-07T20:31:17.4239174Z 2025-05-07T20:31:17.4239178Z 2025-05-07T20:31:17.4239182Z 2025-05-07T20:31:17.4239353Z  2025-05-07T20:31:17.4239854Z 2025-05-07T20:31:17.4239860Z 2025-05-07T20:31:17.4239864Z 2025-05-07T20:31:17.4239867Z 2025-05-07T20:31:17.4240057Z  2025-05-07T20:31:17.4240277Z 2025-05-07T20:31:17.4240385Z 2025-05-07T20:31:17.4240389Z 2025-05-07T20:31:17.4240392Z 2025-05-07T20:31:17.4240396Z 2025-05-07T20:31:17.4240585Z  2025-05-07T20:31:17.4240813Z 2025-05-07T20:31:17.4240817Z 2025-05-07T20:31:17.4240820Z 2025-05-07T20:31:17.4240824Z 2025-05-07T20:31:17.4240827Z 2025-05-07T20:31:17.4240831Z 2025-05-07T20:31:17.4241007Z  2025-05-07T20:31:17.4241236Z 2025-05-07T20:31:17.4241240Z 2025-05-07T20:31:17.4241243Z 2025-05-07T20:31:17.4241247Z 2025-05-07T20:31:17.4241250Z 2025-05-07T20:31:17.4241254Z 2025-05-07T20:31:17.4241257Z 2025-05-07T20:31:17.4241452Z  done 2025-05-07T20:31:17.5245571Z Preparing transaction: \ done 2025-05-07T20:31:17.6248247Z Verifying transaction: / done 2025-05-07T20:31:19.5277382Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done 2025-05-07T20:31:19.6701329Z [TEST] Checking imports ... 2025-05-07T20:31:23.6781046Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:31:23.6794026Z [TEST] Setting feature flags ... 2025-05-07T20:31:23.6794661Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1 2025-05-07T20:31:23.6795160Z 2025-05-07T20:31:24.1078988Z 2025-05-07T20:31:24.1079994Z [TEST] PyTest args: -v -rsx -s -W ignore::pytest.PytestCollectionWarning 2025-05-07T20:31:24.1080821Z ################################################################################ 2025-05-07T20:31:24.1081268Z # Run FBGEMM-GPU Tests: 2025-05-07T20:31:24.1081601Z # 2025-05-07T20:31:24.1100861Z # [2025-05-07T20:31:24.109Z] + __run_fbgemm_gpu_tests_in_directory build_binary 2025-05-07T20:31:24.1101464Z ################################################################################ 2025-05-07T20:31:24.1101763Z 2025-05-07T20:31:24.1108754Z [TEST] Enumerating ALL test files ... 2025-05-07T20:31:24.1138792Z ./attention/gqa_test.py 2025-05-07T20:31:24.1139175Z ./coalesce/coalesce_test.py 2025-05-07T20:31:24.1139557Z ./comm/multi_gpu_car_test.py 2025-05-07T20:31:24.1139933Z ./gather_scatter/gather_scatter_test.py 2025-05-07T20:31:24.1140304Z ./kv_cache/kv_cache_test.py 2025-05-07T20:31:24.1140566Z ./moe/activation_test.py 2025-05-07T20:31:24.1140813Z ./moe/gather_scatter_test.py 2025-05-07T20:31:24.1141072Z ./moe/layers_test.py 2025-05-07T20:31:24.1141308Z ./moe/shuffling_test.py 2025-05-07T20:31:24.1141551Z ./quantize/quantize_test.py 2025-05-07T20:31:24.1141728Z 2025-05-07T20:31:24.1141846Z [TEST] Enumerating IGNORED test files ... 2025-05-07T20:31:24.1142073Z 2025-05-07T20:31:24.1159608Z ################################################################################ 2025-05-07T20:31:24.1174811Z # [2025-05-07T20:31:24.117Z] Run Python Test Suite: 2025-05-07T20:31:24.1175272Z # ./attention/gqa_test.py 2025-05-07T20:31:24.1175659Z ################################################################################ 2025-05-07T20:31:24.1198919Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py 2025-05-07T20:31:24.1199700Z 2025-05-07T20:31:26.6611728Z ============================= test session starts ============================== 2025-05-07T20:31:26.6612604Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:31:26.6613211Z cachedir: .pytest_cache 2025-05-07T20:31:26.6613808Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:31:26.6614899Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:31:26.6615335Z plugins: hypothesis-6.131.14 2025-05-07T20:31:28.3345804Z collecting ... collected 2 items 2025-05-07T20:31:28.3346149Z 2025-05-07T20:32:05.7728120Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa( 2025-05-07T20:32:05.7730038Z self=, 2025-05-07T20:32:05.7730440Z int4_kv=False, 2025-05-07T20:32:05.7730705Z num_groups=1, 2025-05-07T20:32:05.7730955Z B=1, 2025-05-07T20:32:05.7731181Z MAX_T=4, 2025-05-07T20:32:05.7731441Z N_H_L=1, 2025-05-07T20:32:05.7731693Z ) 2025-05-07T20:32:05.7731930Z Trying example: test_gqa( 2025-05-07T20:32:05.7732292Z self=, 2025-05-07T20:32:05.7732671Z int4_kv=True, 2025-05-07T20:32:05.7732925Z num_groups=1, 2025-05-07T20:32:05.7733176Z B=1, 2025-05-07T20:32:05.7733394Z MAX_T=4, 2025-05-07T20:32:05.7733626Z N_H_L=1, 2025-05-07T20:32:05.7733854Z ) 2025-05-07T20:32:05.7734095Z Trying example: test_gqa( 2025-05-07T20:32:05.7734456Z self=, 2025-05-07T20:32:05.7735003Z int4_kv=True, 2025-05-07T20:32:05.7735250Z num_groups=4, 2025-05-07T20:32:05.7735503Z B=23, 2025-05-07T20:32:05.7735743Z MAX_T=33, 2025-05-07T20:32:05.7736010Z N_H_L=68, 2025-05-07T20:32:05.7736245Z ) 2025-05-07T20:32:05.7736484Z Trying example: test_gqa( 2025-05-07T20:32:05.7736831Z self=, 2025-05-07T20:32:05.7737214Z int4_kv=True, 2025-05-07T20:32:05.7737478Z num_groups=4, 2025-05-07T20:32:05.7737731Z B=77, 2025-05-07T20:32:05.7737952Z MAX_T=4, 2025-05-07T20:32:05.7738189Z N_H_L=1, 2025-05-07T20:32:05.7738421Z ) 2025-05-07T20:32:05.7738650Z Trying example: test_gqa( 2025-05-07T20:32:05.7739010Z self=, 2025-05-07T20:32:05.7739393Z int4_kv=True, 2025-05-07T20:32:05.7739646Z num_groups=4, 2025-05-07T20:32:05.7739893Z B=77, 2025-05-07T20:32:05.7740123Z MAX_T=52, 2025-05-07T20:32:05.7740353Z N_H_L=67, 2025-05-07T20:32:05.7740586Z ) 2025-05-07T20:32:05.7740821Z Trying example: test_gqa( 2025-05-07T20:32:05.7741171Z self=, 2025-05-07T20:32:05.7741611Z int4_kv=False, 2025-05-07T20:32:05.7741874Z num_groups=4, 2025-05-07T20:32:05.7742116Z B=57, 2025-05-07T20:32:05.7742343Z MAX_T=45, 2025-05-07T20:32:05.7742585Z N_H_L=120, 2025-05-07T20:32:05.7742813Z ) 2025-05-07T20:32:05.7743047Z Trying example: test_gqa( 2025-05-07T20:32:05.7743396Z self=, 2025-05-07T20:32:05.7743766Z int4_kv=True, 2025-05-07T20:32:05.7744017Z num_groups=4, 2025-05-07T20:32:05.7744263Z B=52, 2025-05-07T20:32:05.7744480Z MAX_T=42, 2025-05-07T20:32:05.7744717Z N_H_L=53, 2025-05-07T20:32:05.7745171Z ) 2025-05-07T20:32:05.7745398Z Trying example: test_gqa( 2025-05-07T20:32:05.7745753Z self=, 2025-05-07T20:32:05.7746144Z int4_kv=True, 2025-05-07T20:32:05.7746397Z num_groups=1, 2025-05-07T20:32:05.7746636Z B=77, 2025-05-07T20:32:05.7746862Z MAX_T=95, 2025-05-07T20:32:05.7747096Z N_H_L=53, 2025-05-07T20:32:05.7747321Z ) 2025-05-07T20:32:05.7747563Z Trying example: test_gqa( 2025-05-07T20:32:05.7747916Z self=, 2025-05-07T20:32:05.7748287Z int4_kv=True, 2025-05-07T20:32:05.7748540Z num_groups=4, 2025-05-07T20:32:05.7748794Z B=113, 2025-05-07T20:32:05.7749013Z MAX_T=48, 2025-05-07T20:32:05.7749246Z N_H_L=96, 2025-05-07T20:32:05.7749477Z ) 2025-05-07T20:32:05.7749705Z Trying example: test_gqa( 2025-05-07T20:32:05.7750055Z self=, 2025-05-07T20:32:05.7750434Z int4_kv=False, 2025-05-07T20:32:05.7750683Z num_groups=1, 2025-05-07T20:32:05.7750929Z B=51, 2025-05-07T20:32:05.7751154Z MAX_T=61, 2025-05-07T20:32:05.7751384Z N_H_L=69, 2025-05-07T20:32:05.7751847Z ) 2025-05-07T20:32:05.7752090Z Trying example: test_gqa( 2025-05-07T20:32:05.7752438Z self=, 2025-05-07T20:32:05.7752825Z int4_kv=False, 2025-05-07T20:32:05.7753080Z num_groups=4, 2025-05-07T20:32:05.7753408Z B=17, 2025-05-07T20:32:05.7753639Z MAX_T=113, 2025-05-07T20:32:05.7753878Z N_H_L=65, 2025-05-07T20:32:05.7754102Z ) 2025-05-07T20:32:05.7754335Z Trying example: test_gqa( 2025-05-07T20:32:05.7754685Z self=, 2025-05-07T20:32:05.7755070Z int4_kv=False, 2025-05-07T20:32:05.7755318Z num_groups=4, 2025-05-07T20:32:05.7755563Z B=17, 2025-05-07T20:32:05.7755794Z MAX_T=65, 2025-05-07T20:32:05.7756031Z N_H_L=65, 2025-05-07T20:32:05.7756266Z ) 2025-05-07T20:32:05.7756501Z Trying example: test_gqa( 2025-05-07T20:32:05.7756856Z self=, 2025-05-07T20:32:05.7757243Z int4_kv=False, 2025-05-07T20:32:05.7757508Z num_groups=4, 2025-05-07T20:32:05.7757758Z B=65, 2025-05-07T20:32:05.7757988Z MAX_T=65, 2025-05-07T20:32:05.7758227Z N_H_L=65, 2025-05-07T20:32:05.7758453Z ) 2025-05-07T20:32:05.7758687Z Trying example: test_gqa( 2025-05-07T20:32:05.7759042Z self=, 2025-05-07T20:32:05.7759431Z int4_kv=False, 2025-05-07T20:32:05.7759688Z num_groups=1, 2025-05-07T20:32:05.7759936Z B=6, 2025-05-07T20:32:05.7760161Z MAX_T=108, 2025-05-07T20:32:05.7760409Z N_H_L=14, 2025-05-07T20:32:05.7760649Z ) 2025-05-07T20:32:05.7760875Z Trying example: test_gqa( 2025-05-07T20:32:05.7761229Z self=, 2025-05-07T20:32:05.7761665Z int4_kv=False, 2025-05-07T20:32:05.7761916Z num_groups=1, 2025-05-07T20:32:05.7762168Z B=6, 2025-05-07T20:32:05.7762397Z MAX_T=14, 2025-05-07T20:32:05.7762631Z N_H_L=14, 2025-05-07T20:32:05.7762866Z ) 2025-05-07T20:32:05.7763104Z Trying example: test_gqa( 2025-05-07T20:32:05.7763472Z self=, 2025-05-07T20:32:05.7763854Z int4_kv=False, 2025-05-07T20:32:05.7764109Z num_groups=1, 2025-05-07T20:32:05.7764356Z B=6, 2025-05-07T20:32:05.7764572Z MAX_T=6, 2025-05-07T20:32:05.7764805Z N_H_L=14, 2025-05-07T20:32:05.7765045Z ) 2025-05-07T20:32:05.7765270Z Trying example: test_gqa( 2025-05-07T20:32:05.7765622Z self=, 2025-05-07T20:32:05.7766006Z int4_kv=False, 2025-05-07T20:32:05.7766251Z num_groups=1, 2025-05-07T20:32:05.7766500Z B=6, 2025-05-07T20:32:05.7766724Z MAX_T=6, 2025-05-07T20:32:05.7766947Z N_H_L=6, 2025-05-07T20:32:05.7767175Z ) 2025-05-07T20:32:05.7767409Z Trying example: test_gqa( 2025-05-07T20:32:05.7767754Z self=, 2025-05-07T20:32:05.7768142Z int4_kv=False, 2025-05-07T20:32:05.7768398Z num_groups=1, 2025-05-07T20:32:05.7768638Z B=70, 2025-05-07T20:32:05.7768868Z MAX_T=94, 2025-05-07T20:32:05.7769107Z N_H_L=78, 2025-05-07T20:32:05.7769340Z ) 2025-05-07T20:32:05.7769577Z Trying example: test_gqa( 2025-05-07T20:32:05.7769930Z self=, 2025-05-07T20:32:05.7770309Z int4_kv=False, 2025-05-07T20:32:05.7770565Z num_groups=1, 2025-05-07T20:32:05.7770818Z B=78, 2025-05-07T20:32:05.7771037Z MAX_T=94, 2025-05-07T20:32:05.7771273Z N_H_L=78, 2025-05-07T20:32:05.7771508Z ) 2025-05-07T20:32:05.7771729Z Trying example: test_gqa( 2025-05-07T20:32:05.7772080Z self=, 2025-05-07T20:32:05.7772461Z int4_kv=False, 2025-05-07T20:32:05.7772716Z num_groups=1, 2025-05-07T20:32:05.7772954Z B=94, 2025-05-07T20:32:05.7773179Z MAX_T=94, 2025-05-07T20:32:05.7773419Z N_H_L=78, 2025-05-07T20:32:05.7773644Z ) 2025-05-07T20:32:05.7773878Z Trying example: test_gqa( 2025-05-07T20:32:05.7774230Z self=, 2025-05-07T20:32:05.7774733Z int4_kv=False, 2025-05-07T20:32:05.7775148Z num_groups=1, 2025-05-07T20:32:05.7775409Z B=94, 2025-05-07T20:32:05.7775643Z MAX_T=94, 2025-05-07T20:32:05.7775891Z N_H_L=94, 2025-05-07T20:32:05.7776136Z ) 2025-05-07T20:32:05.7776365Z Trying example: test_gqa( 2025-05-07T20:32:05.7776737Z self=, 2025-05-07T20:32:05.7777184Z int4_kv=False, 2025-05-07T20:32:05.7777394Z num_groups=4, 2025-05-07T20:32:05.7777608Z B=41, 2025-05-07T20:32:05.7777806Z MAX_T=105, 2025-05-07T20:32:05.7777998Z N_H_L=126, 2025-05-07T20:32:05.7778196Z ) 2025-05-07T20:32:05.7778392Z Trying example: test_gqa( 2025-05-07T20:32:05.7778677Z self=, 2025-05-07T20:32:05.7778996Z int4_kv=False, 2025-05-07T20:32:05.7779204Z num_groups=4, 2025-05-07T20:32:05.7779405Z B=105, 2025-05-07T20:32:05.7779600Z MAX_T=105, 2025-05-07T20:32:05.7779809Z N_H_L=126, 2025-05-07T20:32:05.7780004Z ) 2025-05-07T20:32:05.7780209Z Trying example: test_gqa( 2025-05-07T20:32:05.7780499Z self=, 2025-05-07T20:32:05.7780870Z int4_kv=False, 2025-05-07T20:32:05.7781140Z num_groups=4, 2025-05-07T20:32:05.7781679Z B=105, 2025-05-07T20:32:05.7781958Z MAX_T=105, 2025-05-07T20:32:05.7782201Z N_H_L=105, 2025-05-07T20:32:05.7791707Z ) 2025-05-07T20:32:05.7791935Z Trying example: test_gqa( 2025-05-07T20:32:05.7792252Z self=, 2025-05-07T20:32:05.7792579Z int4_kv=True, 2025-05-07T20:32:05.7792790Z num_groups=1, 2025-05-07T20:32:05.7792998Z B=95, 2025-05-07T20:32:05.7793198Z MAX_T=114, 2025-05-07T20:32:05.7793398Z N_H_L=43, 2025-05-07T20:32:05.7793604Z ) 2025-05-07T20:32:05.7793808Z Trying example: test_gqa( 2025-05-07T20:32:05.7794107Z self=, 2025-05-07T20:32:05.7794426Z int4_kv=True, 2025-05-07T20:32:05.7794645Z num_groups=1, 2025-05-07T20:32:05.7794871Z B=43, 2025-05-07T20:32:05.7795059Z MAX_T=114, 2025-05-07T20:32:05.7795278Z N_H_L=43, 2025-05-07T20:32:05.7795462Z ) 2025-05-07T20:32:05.7795648Z Trying example: test_gqa( 2025-05-07T20:32:05.7795954Z self=, 2025-05-07T20:32:05.7796283Z int4_kv=True, 2025-05-07T20:32:05.7796498Z num_groups=1, 2025-05-07T20:32:05.7796715Z B=43, 2025-05-07T20:32:05.7796915Z MAX_T=43, 2025-05-07T20:32:05.7797112Z N_H_L=43, 2025-05-07T20:32:05.7797310Z ) 2025-05-07T20:32:05.7797509Z Trying example: test_gqa( 2025-05-07T20:32:05.7797797Z self=, 2025-05-07T20:32:05.7798123Z int4_kv=False, 2025-05-07T20:32:05.7798345Z num_groups=1, 2025-05-07T20:32:05.7798550Z B=21, 2025-05-07T20:32:05.7798747Z MAX_T=38, 2025-05-07T20:32:05.7798951Z N_H_L=42, 2025-05-07T20:32:05.7799144Z ) 2025-05-07T20:32:05.7799344Z Trying example: test_gqa( 2025-05-07T20:32:05.7799643Z self=, 2025-05-07T20:32:05.7799964Z int4_kv=False, 2025-05-07T20:32:05.7800181Z num_groups=1, 2025-05-07T20:32:05.7800389Z B=38, 2025-05-07T20:32:05.7800573Z MAX_T=38, 2025-05-07T20:32:05.7800773Z N_H_L=42, 2025-05-07T20:32:05.7800966Z ) 2025-05-07T20:32:05.7801162Z Trying example: test_gqa( 2025-05-07T20:32:05.7801455Z self=, 2025-05-07T20:32:05.7801794Z int4_kv=False, 2025-05-07T20:32:05.7802050Z num_groups=1, 2025-05-07T20:32:05.7802252Z B=38, 2025-05-07T20:32:05.7802445Z MAX_T=42, 2025-05-07T20:32:05.7802645Z N_H_L=42, 2025-05-07T20:32:05.7802837Z ) 2025-05-07T20:32:05.7803037Z Trying example: test_gqa( 2025-05-07T20:32:05.7803336Z self=, 2025-05-07T20:32:05.7803655Z int4_kv=False, 2025-05-07T20:32:05.7803874Z num_groups=1, 2025-05-07T20:32:05.7804087Z B=42, 2025-05-07T20:32:05.7804275Z MAX_T=42, 2025-05-07T20:32:05.7804481Z N_H_L=42, 2025-05-07T20:32:05.7804683Z ) 2025-05-07T20:32:05.7805006Z Trying example: test_gqa( 2025-05-07T20:32:05.7805311Z self=, 2025-05-07T20:32:05.7805636Z int4_kv=True, 2025-05-07T20:32:05.7805843Z num_groups=1, 2025-05-07T20:32:05.7806054Z B=74, 2025-05-07T20:32:05.7806340Z MAX_T=20, 2025-05-07T20:32:05.7806535Z N_H_L=15, 2025-05-07T20:32:05.7806733Z ) 2025-05-07T20:32:05.7806939Z Trying example: test_gqa( 2025-05-07T20:32:05.7807235Z self=, 2025-05-07T20:32:05.7807567Z int4_kv=True, 2025-05-07T20:32:05.7807784Z num_groups=1, 2025-05-07T20:32:05.7807990Z B=20, 2025-05-07T20:32:05.7808191Z MAX_T=20, 2025-05-07T20:32:05.7808396Z N_H_L=15, 2025-05-07T20:32:05.7808592Z ) 2025-05-07T20:32:05.7808793Z Trying example: test_gqa( 2025-05-07T20:32:05.7809094Z self=, 2025-05-07T20:32:05.7809419Z int4_kv=True, 2025-05-07T20:32:05.7809628Z num_groups=1, 2025-05-07T20:32:05.7809836Z B=20, 2025-05-07T20:32:05.7810037Z MAX_T=15, 2025-05-07T20:32:05.7810231Z N_H_L=15, 2025-05-07T20:32:05.7810429Z ) 2025-05-07T20:32:05.7810636Z Trying example: test_gqa( 2025-05-07T20:32:05.7810932Z self=, 2025-05-07T20:32:05.7811263Z int4_kv=True, 2025-05-07T20:32:05.7811497Z num_groups=1, 2025-05-07T20:32:05.7811723Z B=15, 2025-05-07T20:32:05.7811912Z MAX_T=20, 2025-05-07T20:32:05.7812117Z N_H_L=15, 2025-05-07T20:32:05.7812307Z ) 2025-05-07T20:32:05.7812498Z Trying example: test_gqa( 2025-05-07T20:32:05.7812788Z self=, 2025-05-07T20:32:05.7813099Z int4_kv=True, 2025-05-07T20:32:05.7813310Z num_groups=1, 2025-05-07T20:32:05.7813511Z B=15, 2025-05-07T20:32:05.7813692Z MAX_T=15, 2025-05-07T20:32:05.7813888Z N_H_L=15, 2025-05-07T20:32:05.7814077Z ) 2025-05-07T20:32:05.7814263Z Trying example: test_gqa( 2025-05-07T20:32:05.7814669Z self=, 2025-05-07T20:32:05.7815015Z int4_kv=False, 2025-05-07T20:32:05.7815220Z num_groups=4, 2025-05-07T20:32:05.7815437Z B=117, 2025-05-07T20:32:05.7815630Z MAX_T=104, 2025-05-07T20:32:05.7815819Z N_H_L=69, 2025-05-07T20:32:05.7816015Z ) 2025-05-07T20:32:05.7816211Z Trying example: test_gqa( 2025-05-07T20:32:05.7816506Z self=, 2025-05-07T20:32:05.7816817Z int4_kv=False, 2025-05-07T20:32:05.7817032Z num_groups=4, 2025-05-07T20:32:05.7817239Z B=117, 2025-05-07T20:32:05.7817421Z MAX_T=117, 2025-05-07T20:32:05.7817618Z N_H_L=69, 2025-05-07T20:32:05.7817809Z ) 2025-05-07T20:32:05.7817995Z Trying example: test_gqa( 2025-05-07T20:32:05.7818292Z self=, 2025-05-07T20:32:05.7818614Z int4_kv=False, 2025-05-07T20:32:05.7818818Z num_groups=4, 2025-05-07T20:32:05.7819026Z B=69, 2025-05-07T20:32:05.7819217Z MAX_T=117, 2025-05-07T20:32:05.7819410Z N_H_L=69, 2025-05-07T20:32:05.7819605Z ) 2025-05-07T20:32:05.7819802Z Trying example: test_gqa( 2025-05-07T20:32:05.7820086Z self=, 2025-05-07T20:32:05.7820408Z int4_kv=False, 2025-05-07T20:32:05.7820619Z num_groups=4, 2025-05-07T20:32:05.7820817Z B=117, 2025-05-07T20:32:05.7821015Z MAX_T=69, 2025-05-07T20:32:05.7821211Z N_H_L=69, 2025-05-07T20:32:05.7821390Z ) 2025-05-07T20:32:05.7821573Z PASSED 2025-05-07T20:32:05.7931567Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...) 2025-05-07T20:32:05.7931914Z 2025-05-07T20:32:05.7932096Z =========================== short test summary info ============================ 2025-05-07T20:32:05.7932853Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when CUDA is not available or xformers is not available 2025-05-07T20:32:05.7933598Z ======================== 1 passed, 1 skipped in 39.63s ========================= 2025-05-07T20:32:06.4649061Z 2025-05-07T20:32:06.4650172Z [TEST] Python test suite PASSED: ./attention/gqa_test.py 2025-05-07T20:32:06.4670134Z [TEST] Python test time for ./attention/gqa_test.py: 42 seconds 2025-05-07T20:32:06.4670428Z 2025-05-07T20:32:06.4670433Z 2025-05-07T20:32:06.4670436Z 2025-05-07T20:32:06.4670651Z 2025-05-07T20:32:06.4692908Z ################################################################################ 2025-05-07T20:32:06.4709110Z # [2025-05-07T20:32:06.470Z] Run Python Test Suite: 2025-05-07T20:32:06.4709453Z # ./coalesce/coalesce_test.py 2025-05-07T20:32:06.4709755Z ################################################################################ 2025-05-07T20:32:06.4733994Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py 2025-05-07T20:32:06.4734714Z 2025-05-07T20:32:08.6441358Z ============================= test session starts ============================== 2025-05-07T20:32:08.6442070Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:08.6442633Z cachedir: .pytest_cache 2025-05-07T20:32:08.6443314Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:08.6444193Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:08.6444671Z plugins: hypothesis-6.131.14 2025-05-07T20:32:10.3877384Z collecting ... collected 1 item 2025-05-07T20:32:10.3877623Z 2025-05-07T20:32:11.1683453Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED 2025-05-07T20:32:11.1683815Z 2025-05-07T20:32:11.1683974Z ============================== 1 passed in 2.65s =============================== 2025-05-07T20:32:11.8201791Z 2025-05-07T20:32:11.8202103Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py 2025-05-07T20:32:11.8222678Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds 2025-05-07T20:32:11.8223004Z 2025-05-07T20:32:11.8223009Z 2025-05-07T20:32:11.8223013Z 2025-05-07T20:32:11.8223017Z 2025-05-07T20:32:11.8245730Z ################################################################################ 2025-05-07T20:32:11.8261098Z # [2025-05-07T20:32:11.825Z] Run Python Test Suite: 2025-05-07T20:32:11.8261462Z # ./comm/multi_gpu_car_test.py 2025-05-07T20:32:11.8261759Z ################################################################################ 2025-05-07T20:32:11.8287505Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py 2025-05-07T20:32:11.8288152Z 2025-05-07T20:32:14.0201812Z ============================= test session starts ============================== 2025-05-07T20:32:14.0202878Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:14.0203799Z cachedir: .pytest_cache 2025-05-07T20:32:14.0204782Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:14.0206061Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:14.0206802Z plugins: hypothesis-6.131.14 2025-05-07T20:32:15.7376329Z collecting ... collected 5 items 2025-05-07T20:32:15.7376567Z 2025-05-07T20:32:15.7390566Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED 2025-05-07T20:32:15.7400248Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED 2025-05-07T20:32:15.7408955Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED 2025-05-07T20:32:15.7417716Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED 2025-05-07T20:32:15.7437919Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED 2025-05-07T20:32:15.7438274Z 2025-05-07T20:32:15.7438776Z =========================== short test summary info ============================ 2025-05-07T20:32:15.7439506Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:32:15.7440499Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:32:15.7441671Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:32:15.7442659Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:32:15.7443644Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs 2025-05-07T20:32:15.7444338Z ============================== 5 skipped in 1.85s ============================== 2025-05-07T20:32:16.3228584Z 2025-05-07T20:32:16.3229082Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py 2025-05-07T20:32:16.3248371Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 5 seconds 2025-05-07T20:32:16.3248677Z 2025-05-07T20:32:16.3248777Z 2025-05-07T20:32:16.3248787Z 2025-05-07T20:32:16.3248822Z 2025-05-07T20:32:16.3271132Z ################################################################################ 2025-05-07T20:32:16.3286732Z # [2025-05-07T20:32:16.328Z] Run Python Test Suite: 2025-05-07T20:32:16.3287236Z # ./gather_scatter/gather_scatter_test.py 2025-05-07T20:32:16.3287675Z ################################################################################ 2025-05-07T20:32:16.3311419Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py 2025-05-07T20:32:16.3312219Z 2025-05-07T20:32:18.5011851Z ============================= test session starts ============================== 2025-05-07T20:32:18.5012589Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:18.5013139Z cachedir: .pytest_cache 2025-05-07T20:32:18.5013754Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:18.5014598Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:18.5015018Z plugins: hypothesis-6.131.14 2025-05-07T20:32:20.3092863Z collecting ... collected 2 items 2025-05-07T20:32:20.3093182Z 2025-05-07T20:32:20.3104464Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED 2025-05-07T20:32:20.3121497Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED 2025-05-07T20:32:20.3122103Z 2025-05-07T20:32:20.3122330Z =========================== short test summary info ============================ 2025-05-07T20:32:20.3122983Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:32:20.3123854Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU. 2025-05-07T20:32:20.3124488Z ============================== 2 skipped in 1.93s ============================== 2025-05-07T20:32:20.9175525Z 2025-05-07T20:32:20.9176254Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py 2025-05-07T20:32:20.9198292Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 4 seconds 2025-05-07T20:32:20.9198791Z 2025-05-07T20:32:20.9198796Z 2025-05-07T20:32:20.9198801Z 2025-05-07T20:32:20.9198806Z 2025-05-07T20:32:20.9221598Z ################################################################################ 2025-05-07T20:32:20.9237989Z # [2025-05-07T20:32:20.923Z] Run Python Test Suite: 2025-05-07T20:32:20.9238914Z # ./kv_cache/kv_cache_test.py 2025-05-07T20:32:20.9239291Z ################################################################################ 2025-05-07T20:32:20.9263739Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py 2025-05-07T20:32:20.9264731Z 2025-05-07T20:32:23.0935959Z ============================= test session starts ============================== 2025-05-07T20:32:23.0936712Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:23.0937244Z cachedir: .pytest_cache 2025-05-07T20:32:23.0937838Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:23.0938594Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:23.0939017Z plugins: hypothesis-6.131.14 2025-05-07T20:32:24.7942484Z collecting ... collected 4 items 2025-05-07T20:32:24.7942802Z 2025-05-07T20:32:27.6271323Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...) 2025-05-07T20:32:27.6356825Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED 2025-05-07T20:32:27.6452311Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED 2025-05-07T20:32:27.6542058Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED 2025-05-07T20:32:27.6542434Z 2025-05-07T20:32:27.6542610Z =========================== short test summary info ============================ 2025-05-07T20:32:27.6543353Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when H100 is not available or MI300 is not available 2025-05-07T20:32:27.6544317Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/unittest/case.py:154: Skip when xformers is not available 2025-05-07T20:32:27.6544987Z ============================== 4 skipped in 4.68s ============================== 2025-05-07T20:32:29.6340009Z 2025-05-07T20:32:29.6340606Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py 2025-05-07T20:32:29.6362224Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 9 seconds 2025-05-07T20:32:29.6362843Z 2025-05-07T20:32:29.6362852Z 2025-05-07T20:32:29.6362872Z 2025-05-07T20:32:29.6362879Z 2025-05-07T20:32:29.6386153Z ################################################################################ 2025-05-07T20:32:29.6401891Z # [2025-05-07T20:32:29.639Z] Run Python Test Suite: 2025-05-07T20:32:29.6402231Z # ./moe/activation_test.py 2025-05-07T20:32:29.6402710Z ################################################################################ 2025-05-07T20:32:29.6427669Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py 2025-05-07T20:32:29.6428357Z 2025-05-07T20:32:31.8245248Z ============================= test session starts ============================== 2025-05-07T20:32:31.8245893Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:31.8246432Z cachedir: .pytest_cache 2025-05-07T20:32:31.8247034Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:31.8247800Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:31.8248222Z plugins: hypothesis-6.131.14 2025-05-07T20:32:33.5044203Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:33.6129676Z collecting ... collected 2 items 2025-05-07T20:32:33.6129957Z 2025-05-07T20:32:39.1199672Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul( 2025-05-07T20:32:39.1201046Z self=, 2025-05-07T20:32:39.1201965Z T=1, 2025-05-07T20:32:39.1202160Z D=5120, 2025-05-07T20:32:39.1202364Z contiguous=True, 2025-05-07T20:32:39.1202601Z compiled=True, 2025-05-07T20:32:39.1202804Z ) 2025-05-07T20:32:39.1203006Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1203594Z self=, 2025-05-07T20:32:39.1203984Z T=4096, 2025-05-07T20:32:39.1204185Z D=5120, 2025-05-07T20:32:39.1204384Z contiguous=True, 2025-05-07T20:32:39.1204607Z compiled=True, 2025-05-07T20:32:39.1204817Z ) 2025-05-07T20:32:39.1205017Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1205396Z self=, 2025-05-07T20:32:39.1205799Z T=4096, 2025-05-07T20:32:39.1205994Z D=7168, 2025-05-07T20:32:39.1206187Z contiguous=False, 2025-05-07T20:32:39.1206416Z compiled=False, 2025-05-07T20:32:39.1206627Z ) 2025-05-07T20:32:39.1206815Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1207206Z self=, 2025-05-07T20:32:39.1207605Z T=4096, 2025-05-07T20:32:39.1207796Z D=5120, 2025-05-07T20:32:39.1207989Z contiguous=False, 2025-05-07T20:32:39.1208221Z compiled=True, 2025-05-07T20:32:39.1208430Z ) 2025-05-07T20:32:39.1208624Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1209005Z self=, 2025-05-07T20:32:39.1209395Z T=1, 2025-05-07T20:32:39.1209575Z D=7168, 2025-05-07T20:32:39.1209777Z contiguous=True, 2025-05-07T20:32:39.1210009Z compiled=True, 2025-05-07T20:32:39.1210207Z ) 2025-05-07T20:32:39.1210408Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1210791Z self=, 2025-05-07T20:32:39.1211181Z T=1, 2025-05-07T20:32:39.1211374Z D=7168, 2025-05-07T20:32:39.1211580Z contiguous=False, 2025-05-07T20:32:39.1211803Z compiled=True, 2025-05-07T20:32:39.1212024Z ) 2025-05-07T20:32:39.1212234Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1212613Z self=, 2025-05-07T20:32:39.1213011Z T=4096, 2025-05-07T20:32:39.1213214Z D=5120, 2025-05-07T20:32:39.1213416Z contiguous=False, 2025-05-07T20:32:39.1213653Z compiled=False, 2025-05-07T20:32:39.1213872Z ) 2025-05-07T20:32:39.1214072Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1214597Z self=, 2025-05-07T20:32:39.1215007Z T=1, 2025-05-07T20:32:39.1215212Z D=7168, 2025-05-07T20:32:39.1215414Z contiguous=True, 2025-05-07T20:32:39.1215660Z compiled=False, 2025-05-07T20:32:39.1215884Z ) 2025-05-07T20:32:39.1216089Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1216483Z self=, 2025-05-07T20:32:39.1216892Z T=2048, 2025-05-07T20:32:39.1217091Z D=5120, 2025-05-07T20:32:39.1217308Z contiguous=True, 2025-05-07T20:32:39.1217547Z compiled=True, 2025-05-07T20:32:39.1217750Z ) 2025-05-07T20:32:39.1217961Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1218349Z self=, 2025-05-07T20:32:39.1218748Z T=2048, 2025-05-07T20:32:39.1218944Z D=7168, 2025-05-07T20:32:39.1219143Z contiguous=True, 2025-05-07T20:32:39.1219364Z compiled=True, 2025-05-07T20:32:39.1219574Z ) 2025-05-07T20:32:39.1219781Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1220167Z self=, 2025-05-07T20:32:39.1220573Z T=2048, 2025-05-07T20:32:39.1220774Z D=7168, 2025-05-07T20:32:39.1220983Z contiguous=True, 2025-05-07T20:32:39.1221216Z compiled=False, 2025-05-07T20:32:39.1221435Z ) 2025-05-07T20:32:39.1221649Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1222142Z self=, 2025-05-07T20:32:39.1222646Z T=128, 2025-05-07T20:32:39.1222847Z D=5120, 2025-05-07T20:32:39.1223045Z contiguous=False, 2025-05-07T20:32:39.1223282Z compiled=True, 2025-05-07T20:32:39.1223503Z ) 2025-05-07T20:32:39.1223701Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1224184Z self=, 2025-05-07T20:32:39.1224596Z T=128, 2025-05-07T20:32:39.1224786Z D=5120, 2025-05-07T20:32:39.1225000Z contiguous=True, 2025-05-07T20:32:39.1225240Z compiled=True, 2025-05-07T20:32:39.1225639Z ) 2025-05-07T20:32:39.1225860Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1226262Z self=, 2025-05-07T20:32:39.1226659Z T=16384, 2025-05-07T20:32:39.1226872Z D=5120, 2025-05-07T20:32:39.1227087Z contiguous=False, 2025-05-07T20:32:39.1227328Z compiled=True, 2025-05-07T20:32:39.1227537Z ) 2025-05-07T20:32:39.1227748Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1228164Z self=, 2025-05-07T20:32:39.1228572Z T=16384, 2025-05-07T20:32:39.1228784Z D=5120, 2025-05-07T20:32:39.1228987Z contiguous=False, 2025-05-07T20:32:39.1229228Z compiled=False, 2025-05-07T20:32:39.1229515Z ) 2025-05-07T20:32:39.1229781Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1230278Z self=, 2025-05-07T20:32:39.1230724Z T=128, 2025-05-07T20:32:39.1231177Z D=7168, 2025-05-07T20:32:39.1241419Z contiguous=True, 2025-05-07T20:32:39.1241706Z compiled=False, 2025-05-07T20:32:39.1241926Z ) 2025-05-07T20:32:39.1242150Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1242557Z self=, 2025-05-07T20:32:39.1242971Z T=128, 2025-05-07T20:32:39.1243195Z D=7168, 2025-05-07T20:32:39.1243417Z contiguous=False, 2025-05-07T20:32:39.1243657Z compiled=False, 2025-05-07T20:32:39.1243906Z ) 2025-05-07T20:32:39.1244134Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1244524Z self=, 2025-05-07T20:32:39.1244945Z T=1, 2025-05-07T20:32:39.1245153Z D=5120, 2025-05-07T20:32:39.1245367Z contiguous=False, 2025-05-07T20:32:39.1245617Z compiled=False, 2025-05-07T20:32:39.1245848Z ) 2025-05-07T20:32:39.1246053Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1246455Z self=, 2025-05-07T20:32:39.1246869Z T=1, 2025-05-07T20:32:39.1247080Z D=7168, 2025-05-07T20:32:39.1247288Z contiguous=False, 2025-05-07T20:32:39.1247539Z compiled=False, 2025-05-07T20:32:39.1247768Z ) 2025-05-07T20:32:39.1247974Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1248386Z self=, 2025-05-07T20:32:39.1248795Z T=4096, 2025-05-07T20:32:39.1248992Z D=5120, 2025-05-07T20:32:39.1249215Z contiguous=True, 2025-05-07T20:32:39.1249456Z compiled=False, 2025-05-07T20:32:39.1249665Z ) 2025-05-07T20:32:39.1249874Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1250270Z self=, 2025-05-07T20:32:39.1250672Z T=128, 2025-05-07T20:32:39.1250872Z D=7168, 2025-05-07T20:32:39.1251083Z contiguous=True, 2025-05-07T20:32:39.1251307Z compiled=True, 2025-05-07T20:32:39.1251524Z ) 2025-05-07T20:32:39.1251733Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1252173Z self=, 2025-05-07T20:32:39.1252583Z T=1, 2025-05-07T20:32:39.1252777Z D=5120, 2025-05-07T20:32:39.1252977Z contiguous=False, 2025-05-07T20:32:39.1253213Z compiled=True, 2025-05-07T20:32:39.1253429Z ) 2025-05-07T20:32:39.1253632Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1254020Z self=, 2025-05-07T20:32:39.1254699Z T=4096, 2025-05-07T20:32:39.1254905Z D=7168, 2025-05-07T20:32:39.1255108Z contiguous=True, 2025-05-07T20:32:39.1255343Z compiled=False, 2025-05-07T20:32:39.1255564Z ) 2025-05-07T20:32:39.1255768Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1256286Z self=, 2025-05-07T20:32:39.1256686Z T=4096, 2025-05-07T20:32:39.1256881Z D=7168, 2025-05-07T20:32:39.1257088Z contiguous=False, 2025-05-07T20:32:39.1257326Z compiled=True, 2025-05-07T20:32:39.1257535Z ) 2025-05-07T20:32:39.1257745Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1258137Z self=, 2025-05-07T20:32:39.1258533Z T=128, 2025-05-07T20:32:39.1258739Z D=5120, 2025-05-07T20:32:39.1258951Z contiguous=True, 2025-05-07T20:32:39.1259181Z compiled=False, 2025-05-07T20:32:39.1259404Z ) 2025-05-07T20:32:39.1259616Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1260019Z self=, 2025-05-07T20:32:39.1260412Z T=128, 2025-05-07T20:32:39.1260613Z D=5120, 2025-05-07T20:32:39.1260852Z contiguous=False, 2025-05-07T20:32:39.1261105Z compiled=False, 2025-05-07T20:32:39.1261331Z ) 2025-05-07T20:32:39.1261539Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1261959Z self=, 2025-05-07T20:32:39.1262356Z T=1, 2025-05-07T20:32:39.1262553Z D=5120, 2025-05-07T20:32:39.1262767Z contiguous=True, 2025-05-07T20:32:39.1262998Z compiled=False, 2025-05-07T20:32:39.1263220Z ) 2025-05-07T20:32:39.1263429Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1263816Z self=, 2025-05-07T20:32:39.1264221Z T=2048, 2025-05-07T20:32:39.1264419Z D=7168, 2025-05-07T20:32:39.1264625Z contiguous=False, 2025-05-07T20:32:39.1264860Z compiled=True, 2025-05-07T20:32:39.1265081Z ) 2025-05-07T20:32:39.1265279Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1265671Z self=, 2025-05-07T20:32:39.1266070Z T=2048, 2025-05-07T20:32:39.1266266Z D=7168, 2025-05-07T20:32:39.1266482Z contiguous=False, 2025-05-07T20:32:39.1266720Z compiled=False, 2025-05-07T20:32:39.1266934Z ) 2025-05-07T20:32:39.1267141Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1267534Z self=, 2025-05-07T20:32:39.1267936Z T=16384, 2025-05-07T20:32:39.1268147Z D=7168, 2025-05-07T20:32:39.1268365Z contiguous=False, 2025-05-07T20:32:39.1268613Z compiled=True, 2025-05-07T20:32:39.1268828Z ) 2025-05-07T20:32:39.1269052Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1269452Z self=, 2025-05-07T20:32:39.1269857Z T=16384, 2025-05-07T20:32:39.1270071Z D=7168, 2025-05-07T20:32:39.1270293Z contiguous=True, 2025-05-07T20:32:39.1270524Z compiled=True, 2025-05-07T20:32:39.1270758Z ) 2025-05-07T20:32:39.1270979Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1271364Z self=, 2025-05-07T20:32:39.1271789Z T=4096, 2025-05-07T20:32:39.1271999Z D=7168, 2025-05-07T20:32:39.1272200Z contiguous=True, 2025-05-07T20:32:39.1272448Z compiled=True, 2025-05-07T20:32:39.1272674Z ) 2025-05-07T20:32:39.1272877Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1273276Z self=, 2025-05-07T20:32:39.1273694Z T=2048, 2025-05-07T20:32:39.1273905Z D=5120, 2025-05-07T20:32:39.1274109Z contiguous=False, 2025-05-07T20:32:39.1274361Z compiled=False, 2025-05-07T20:32:39.1274580Z ) 2025-05-07T20:32:39.1274782Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1275268Z self=, 2025-05-07T20:32:39.1275669Z T=2048, 2025-05-07T20:32:39.1275862Z D=5120, 2025-05-07T20:32:39.1276069Z contiguous=True, 2025-05-07T20:32:39.1276301Z compiled=False, 2025-05-07T20:32:39.1276513Z ) 2025-05-07T20:32:39.1276719Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1277193Z self=, 2025-05-07T20:32:39.1277582Z T=128, 2025-05-07T20:32:39.1277785Z D=7168, 2025-05-07T20:32:39.1277990Z contiguous=False, 2025-05-07T20:32:39.1278216Z compiled=True, 2025-05-07T20:32:39.1278435Z ) 2025-05-07T20:32:39.1278639Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1279021Z self=, 2025-05-07T20:32:39.1279422Z T=16384, 2025-05-07T20:32:39.1279634Z D=5120, 2025-05-07T20:32:39.1279833Z contiguous=True, 2025-05-07T20:32:39.1280063Z compiled=True, 2025-05-07T20:32:39.1280283Z ) 2025-05-07T20:32:39.1280490Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1280891Z self=, 2025-05-07T20:32:39.1281309Z T=2048, 2025-05-07T20:32:39.1281517Z D=5120, 2025-05-07T20:32:39.1281725Z contiguous=False, 2025-05-07T20:32:39.1281981Z compiled=True, 2025-05-07T20:32:39.1282212Z ) 2025-05-07T20:32:39.1282420Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1282824Z self=, 2025-05-07T20:32:39.1283233Z T=16384, 2025-05-07T20:32:39.1283437Z D=5120, 2025-05-07T20:32:39.1283657Z contiguous=True, 2025-05-07T20:32:39.1283902Z compiled=False, 2025-05-07T20:32:39.1284115Z ) 2025-05-07T20:32:39.1284332Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1284734Z self=, 2025-05-07T20:32:39.1285135Z T=16384, 2025-05-07T20:32:39.1285354Z D=7168, 2025-05-07T20:32:39.1285577Z contiguous=False, 2025-05-07T20:32:39.1285818Z compiled=False, 2025-05-07T20:32:39.1286053Z ) 2025-05-07T20:32:39.1286277Z Trying example: test_silu_mul( 2025-05-07T20:32:39.1286663Z self=, 2025-05-07T20:32:39.1287074Z T=16384, 2025-05-07T20:32:39.1287296Z D=7168, 2025-05-07T20:32:39.1287510Z contiguous=True, 2025-05-07T20:32:39.1287735Z compiled=False, 2025-05-07T20:32:39.1287952Z ) 2025-05-07T20:32:39.1288148Z PASSED 2025-05-07T20:32:39.1889125Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:39.1890416Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:39.1893288Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:39.1896520Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:39.1898573Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.1901016Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:39.1902476Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.1903685Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.1904991Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:39.1906581Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.1907712Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.1909068Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:39.1910393Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:39.1911741Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:39.1913023Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:39.1913894Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.1914980Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:39.1916057Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:39.1916890Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:32:39.1918176Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:39.1919537Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:39.1920723Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:39.1921878Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:39.1923129Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:39.1924578Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:39.1925844Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.1926802Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.1927724Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:39.1928800Z W0507 20:32:39.186000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.2051211Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:39.2052327Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:39.2053732Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:39.2055433Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:39.2056465Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.2057851Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:39.2059324Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.2060363Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.2061667Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:39.2063130Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.2064253Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.2065604Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:39.2066924Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:39.2068221Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:39.2069507Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:39.2070376Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.2071456Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:39.2073512Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:39.2074358Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:32:39.2075640Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:39.2077108Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:39.2078276Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:39.2079383Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:39.2080626Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:39.2082063Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:39.2083173Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.2084117Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.2084890Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:39.2085965Z W0507 20:32:39.203000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.2445819Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:39.2448025Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:39.2450444Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:39.2451944Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:39.2452970Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.2454347Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:39.2455923Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.2456954Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.2458412Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:39.2459883Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.2461126Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.2462480Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:39.2463808Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:39.2465102Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:39.2466391Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:39.2467268Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.2468356Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:39.2469430Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:39.2470283Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:32:39.2471609Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:39.2472979Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:39.2474166Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:39.2475266Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:39.2476526Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:39.2477964Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:39.2479089Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.2480043Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.2480866Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:39.2482020Z W0507 20:32:39.243000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.2488511Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:39.2489904Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last): 2025-05-07T20:32:39.2491602Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:39.2493354Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:39.2494618Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.2496233Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:39.2497952Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.2499157Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.2500666Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:39.2502373Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.2503681Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.2505267Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:39.2506802Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] generator.visit(fn.parse()) 2025-05-07T20:32:39.2508304Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:39.2509792Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ret = super().visit(node) 2025-05-07T20:32:39.2510858Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.2512113Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:39.2513362Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] return visitor(node) 2025-05-07T20:32:39.2514316Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^^^^^^^^^^^^^ 2025-05-07T20:32:39.2515974Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:39.2517331Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:39.2518607Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:39.2519701Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] self.visit(item) 2025-05-07T20:32:39.2520982Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:39.2522438Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:39.2523549Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.2524509Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.2525276Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^ 2025-05-07T20:32:39.2526620Z W0507 20:32:39.247000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.6753776Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.6754725Z self=, 2025-05-07T20:32:39.6755302Z T=1, 2025-05-07T20:32:39.6755572Z D=5120, 2025-05-07T20:32:39.6755847Z scale_ub=None, 2025-05-07T20:32:39.6756165Z contiguous=True, 2025-05-07T20:32:39.6756484Z compiled=True, 2025-05-07T20:32:39.6756765Z ) 2025-05-07T20:32:39.6757089Z self = 2025-05-07T20:32:39.6757594Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:39.6757869Z 2025-05-07T20:32:39.6757950Z @given( 2025-05-07T20:32:39.6758189Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:39.6758503Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:39.6758851Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:39.6759198Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:39.6759549Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:39.6759841Z ) 2025-05-07T20:32:39.6760206Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:39.6760672Z def test_silu_mul_quant( 2025-05-07T20:32:39.6760919Z self, 2025-05-07T20:32:39.6761123Z T: int, 2025-05-07T20:32:39.6761327Z D: int, 2025-05-07T20:32:39.6761543Z scale_ub: Optional[float], 2025-05-07T20:32:39.6761824Z contiguous: bool, 2025-05-07T20:32:39.6762070Z compiled: bool, 2025-05-07T20:32:39.6762303Z ) -> None: 2025-05-07T20:32:39.6762519Z torch.manual_seed(2025) 2025-05-07T20:32:39.6762769Z 2025-05-07T20:32:39.6763049Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:39.6763402Z 2025-05-07T20:32:39.6763599Z x_sign = torch.sign(x) 2025-05-07T20:32:39.6763903Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:39.6764218Z x = x_sign * x_clamp 2025-05-07T20:32:39.6764689Z x0 = x[:, :D] 2025-05-07T20:32:39.6764928Z x1 = x[:, D:] 2025-05-07T20:32:39.6765140Z 2025-05-07T20:32:39.6765336Z if contiguous: 2025-05-07T20:32:39.6765582Z x0 = x0.contiguous() 2025-05-07T20:32:39.6765850Z x1 = x1.contiguous() 2025-05-07T20:32:39.6766276Z 2025-05-07T20:32:39.6766482Z if scale_ub is not None: 2025-05-07T20:32:39.6766766Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:39.6767119Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:39.6767444Z ) 2025-05-07T20:32:39.6767642Z else: 2025-05-07T20:32:39.6767865Z scale_ub_tensor = None 2025-05-07T20:32:39.6768133Z 2025-05-07T20:32:39.6768375Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.6768694Z op = silu_mul_quant 2025-05-07T20:32:39.6768953Z if compiled: 2025-05-07T20:32:39.6769212Z op = torch.compile(op) 2025-05-07T20:32:39.6769521Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:39.6769814Z 2025-05-07T20:32:39.6770013Z y_fp8, y_scale = fn() 2025-05-07T20:32:39.6770303Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:39.6770607Z 2025-05-07T20:32:39.6770867Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:39.6771213Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:39.6771511Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:39.6771857Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:39.6772228Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.6772554Z 2025-05-07T20:32:39.6772756Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:39.6772965Z 2025-05-07T20:32:39.6773070Z moe/activation_test.py:126: 2025-05-07T20:32:39.6773378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.6773731Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:39.6774066Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:39.6774972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:39.6775779Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:39.6776353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:39.6786597Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:39.6787359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:39.6788144Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:39.6788920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:39.6789619Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:39.6790267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:39.6790829Z fn() 2025-05-07T20:32:39.6791373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:39.6792006Z self.fn.run( 2025-05-07T20:32:39.6792511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:39.6793080Z kernel = self.compile( 2025-05-07T20:32:39.6793668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:39.6794376Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.6794815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:39.6795179Z 2025-05-07T20:32:39.6795404Z self = 2025-05-07T20:32:39.6796554Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:39.6798089Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1295394540>} 2025-05-07T20:32:39.6799520Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:39.6800667Z context = 2025-05-07T20:32:39.6800982Z 2025-05-07T20:32:39.6801167Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:39.6801722Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.6802214Z module_map=module_map) 2025-05-07T20:32:39.6802596Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.6802987Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:39.6803275Z E ^ 2025-05-07T20:32:39.6803768Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:39.6804243Z 2025-05-07T20:32:39.6804686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:39.6805241Z 2025-05-07T20:32:39.6805350Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:39.6805787Z self=, 2025-05-07T20:32:39.6806210Z T=2048, 2025-05-07T20:32:39.6806410Z D=5120, 2025-05-07T20:32:39.6806613Z scale_ub=1200.0, 2025-05-07T20:32:39.6806844Z contiguous=True, 2025-05-07T20:32:39.6807068Z compiled=False, 2025-05-07T20:32:39.6807285Z ) 2025-05-07T20:32:39.9817677Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:39.9819088Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:39.9820508Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:39.9822074Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:39.9823106Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.9824498Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:39.9826138Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:39.9827167Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.9828668Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:39.9830139Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:39.9831398Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.9832760Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:39.9834086Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:39.9835389Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:39.9836687Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:39.9837558Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:39.9838642Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:39.9839715Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:39.9840562Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:32:39.9841856Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:39.9843224Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:39.9844400Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:39.9845510Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:39.9846771Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:39.9848211Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:39.9849343Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:39.9850298Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:39.9851133Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:39.9852295Z W0507 20:32:39.977000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.0636048Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:40.0637593Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:40.0639001Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:40.0640505Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:40.0641547Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.0642929Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:40.0644399Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.0645439Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.0646734Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:40.0648195Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.0649319Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.0650678Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:40.0652005Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:40.0653295Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:40.0654692Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:40.0655576Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.0656658Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:40.0657725Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:40.0658570Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:32:40.0659984Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:40.0661349Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:40.0662644Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:40.0663739Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:40.0664990Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:40.0666424Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:40.0667549Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.0668515Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.0669290Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:40.0670368Z W0507 20:32:40.060000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.2991988Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:40.2993289Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:40.2994733Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:40.2996240Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:40.2997270Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.2998650Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:40.3000129Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.3001169Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.3002476Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:40.3004115Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.3005243Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.3006711Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:40.3008039Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:40.3009339Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:40.3010661Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:40.3011540Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.3012630Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:40.3013710Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:40.3014667Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:32:40.3015956Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:40.3017306Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:40.3018488Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:40.3019590Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:40.3020833Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:40.3022274Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:40.3023385Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.3024347Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.3025133Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:40.3026385Z W0507 20:32:40.296000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.3095511Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:40.3097102Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last): 2025-05-07T20:32:40.3098529Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:40.3100136Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:40.3101164Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.3102553Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:40.3104008Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.3105052Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.3106351Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:40.3107807Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.3108931Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.3110278Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:40.3111609Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] generator.visit(fn.parse()) 2025-05-07T20:32:40.3112900Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:40.3114181Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ret = super().visit(node) 2025-05-07T20:32:40.3115057Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.3116132Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:40.3117214Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] return visitor(node) 2025-05-07T20:32:40.3118055Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^^^^^^^^^^^^^ 2025-05-07T20:32:40.3119336Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:40.3120767Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:40.3121948Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:40.3123125Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] self.visit(item) 2025-05-07T20:32:40.3124374Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:40.3125973Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:40.3127095Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.3128056Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.3128839Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^ 2025-05-07T20:32:40.3129914Z W0507 20:32:40.306000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.6612660Z self = 2025-05-07T20:32:40.6613547Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:40.6613958Z 2025-05-07T20:32:40.6614070Z @given( 2025-05-07T20:32:40.6614377Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.6614806Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.6615124Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.6615474Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.6615822Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.6616133Z ) 2025-05-07T20:32:40.6616488Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.6616956Z def test_silu_mul_quant( 2025-05-07T20:32:40.6617214Z self, 2025-05-07T20:32:40.6617409Z T: int, 2025-05-07T20:32:40.6617614Z D: int, 2025-05-07T20:32:40.6617843Z scale_ub: Optional[float], 2025-05-07T20:32:40.6618122Z contiguous: bool, 2025-05-07T20:32:40.6618366Z compiled: bool, 2025-05-07T20:32:40.6618585Z ) -> None: 2025-05-07T20:32:40.6618810Z torch.manual_seed(2025) 2025-05-07T20:32:40.6619057Z 2025-05-07T20:32:40.6619336Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.6619691Z 2025-05-07T20:32:40.6619888Z x_sign = torch.sign(x) 2025-05-07T20:32:40.6620176Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.6620498Z x = x_sign * x_clamp 2025-05-07T20:32:40.6620747Z x0 = x[:, :D] 2025-05-07T20:32:40.6620967Z x1 = x[:, D:] 2025-05-07T20:32:40.6621173Z 2025-05-07T20:32:40.6621355Z if contiguous: 2025-05-07T20:32:40.6621590Z x0 = x0.contiguous() 2025-05-07T20:32:40.6621851Z x1 = x1.contiguous() 2025-05-07T20:32:40.6622098Z 2025-05-07T20:32:40.6622295Z if scale_ub is not None: 2025-05-07T20:32:40.6622565Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.6622907Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.6623229Z ) 2025-05-07T20:32:40.6623422Z else: 2025-05-07T20:32:40.6623641Z scale_ub_tensor = None 2025-05-07T20:32:40.6623901Z 2025-05-07T20:32:40.6624322Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.6624654Z op = silu_mul_quant 2025-05-07T20:32:40.6624916Z if compiled: 2025-05-07T20:32:40.6625166Z op = torch.compile(op) 2025-05-07T20:32:40.6625759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6626049Z 2025-05-07T20:32:40.6626251Z > y_fp8, y_scale = fn() 2025-05-07T20:32:40.6626419Z 2025-05-07T20:32:40.6626522Z moe/activation_test.py:117: 2025-05-07T20:32:40.6626827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6627177Z moe/activation_test.py:115: in fn 2025-05-07T20:32:40.6627465Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6628204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:40.6628943Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:40.6629523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.6630247Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.6630956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.6631530Z kernel = self.compile( 2025-05-07T20:32:40.6632101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.6632798Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.6633215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6633454Z 2025-05-07T20:32:40.6633672Z self = 2025-05-07T20:32:40.6634803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.6636241Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1295c8f240>} 2025-05-07T20:32:40.6637660Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.6638760Z context = 2025-05-07T20:32:40.6639066Z 2025-05-07T20:32:40.6639245Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.6639788Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.6640281Z module_map=module_map) 2025-05-07T20:32:40.6640672Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.6641042Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.6641327Z E ^ 2025-05-07T20:32:40.6641823Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.6642302Z 2025-05-07T20:32:40.6642751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.6643296Z 2025-05-07T20:32:40.6643404Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.6643840Z self=, 2025-05-07T20:32:40.6644270Z T=2048, 2025-05-07T20:32:40.6644465Z D=5120, 2025-05-07T20:32:40.6644671Z scale_ub=1200.0, 2025-05-07T20:32:40.6644916Z contiguous=True, 2025-05-07T20:32:40.6645150Z compiled=True, 2025-05-07T20:32:40.6645378Z ) 2025-05-07T20:32:40.6645858Z self = 2025-05-07T20:32:40.6646379Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:40.6646678Z 2025-05-07T20:32:40.6646765Z @given( 2025-05-07T20:32:40.6647136Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:40.6647472Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:40.6647793Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:40.6648145Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:40.6648498Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:40.6648800Z ) 2025-05-07T20:32:40.6649175Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:40.6649647Z def test_silu_mul_quant( 2025-05-07T20:32:40.6649906Z self, 2025-05-07T20:32:40.6650132Z T: int, 2025-05-07T20:32:40.6650355Z D: int, 2025-05-07T20:32:40.6650600Z scale_ub: Optional[float], 2025-05-07T20:32:40.6650886Z contiguous: bool, 2025-05-07T20:32:40.6651147Z compiled: bool, 2025-05-07T20:32:40.6651396Z ) -> None: 2025-05-07T20:32:40.6651624Z torch.manual_seed(2025) 2025-05-07T20:32:40.6651885Z 2025-05-07T20:32:40.6652173Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:40.6652524Z 2025-05-07T20:32:40.6652719Z x_sign = torch.sign(x) 2025-05-07T20:32:40.6653016Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:40.6653329Z x = x_sign * x_clamp 2025-05-07T20:32:40.6653570Z x0 = x[:, :D] 2025-05-07T20:32:40.6653787Z x1 = x[:, D:] 2025-05-07T20:32:40.6653988Z 2025-05-07T20:32:40.6654177Z if contiguous: 2025-05-07T20:32:40.6654458Z x0 = x0.contiguous() 2025-05-07T20:32:40.6654717Z x1 = x1.contiguous() 2025-05-07T20:32:40.6654966Z 2025-05-07T20:32:40.6655165Z if scale_ub is not None: 2025-05-07T20:32:40.6655443Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:40.6655787Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:40.6656110Z ) 2025-05-07T20:32:40.6656318Z else: 2025-05-07T20:32:40.6656542Z scale_ub_tensor = None 2025-05-07T20:32:40.6656813Z 2025-05-07T20:32:40.6657053Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.6657366Z op = silu_mul_quant 2025-05-07T20:32:40.6657618Z if compiled: 2025-05-07T20:32:40.6657871Z op = torch.compile(op) 2025-05-07T20:32:40.6658168Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:40.6658456Z 2025-05-07T20:32:40.6658652Z y_fp8, y_scale = fn() 2025-05-07T20:32:40.6658940Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:40.6659255Z 2025-05-07T20:32:40.6659506Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:40.6659855Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:40.6660164Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:40.6660501Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:40.6660927Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:40.6661248Z 2025-05-07T20:32:40.6661459Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:40.6661657Z 2025-05-07T20:32:40.6661767Z moe/activation_test.py:126: 2025-05-07T20:32:40.6662069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6662419Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:40.6662762Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:40.6663580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:40.6664385Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:40.6665052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:40.6665783Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:40.6666509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:40.6667359Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:40.6668137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:40.6668819Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:40.6669457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:40.6670021Z fn() 2025-05-07T20:32:40.6670570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:40.6671218Z self.fn.run( 2025-05-07T20:32:40.6671742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:40.6672318Z kernel = self.compile( 2025-05-07T20:32:40.6672898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:40.6673587Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.6674009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:40.6674247Z 2025-05-07T20:32:40.6674471Z self = 2025-05-07T20:32:40.6675616Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:40.6677043Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f129679a020>} 2025-05-07T20:32:40.6678461Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:40.6679560Z context = 2025-05-07T20:32:40.6679860Z 2025-05-07T20:32:40.6680042Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:40.6680582Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.6681126Z module_map=module_map) 2025-05-07T20:32:40.6681510Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.6681883Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:40.6682163Z E ^ 2025-05-07T20:32:40.6682647Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.6683117Z 2025-05-07T20:32:40.6683556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:40.6684101Z 2025-05-07T20:32:40.6684213Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:40.6684635Z self=, 2025-05-07T20:32:40.6685052Z T=16384, 2025-05-07T20:32:40.6685252Z D=7168, 2025-05-07T20:32:40.6685441Z scale_ub=1200.0, 2025-05-07T20:32:40.6685663Z contiguous=False, 2025-05-07T20:32:40.6685886Z compiled=False, 2025-05-07T20:32:40.6686086Z ) 2025-05-07T20:32:40.8598907Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:40.8600223Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:40.8601684Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:40.8603312Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:40.8604333Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.8605713Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:40.8607177Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.8608216Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.8609514Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:40.8610966Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.8612097Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.8613465Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:40.8614910Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:40.8616209Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:40.8617499Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:40.8618378Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.8619458Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:40.8620541Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:40.8621421Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:32:40.8622702Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:40.8624207Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:40.8625609Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:40.8626892Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:40.8628133Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:40.8629562Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:40.8630780Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.8632002Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.8632886Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:40.8641320Z W0507 20:32:40.856000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:40.9196465Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:40.9197784Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:40.9199215Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:40.9200734Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:40.9201759Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.9203146Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:40.9204617Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:40.9205662Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.9207117Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:40.9208580Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:40.9209894Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.9211262Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:40.9212707Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:40.9214009Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:40.9215430Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:40.9216316Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:40.9217650Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:40.9218743Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:40.9219584Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:32:40.9220859Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:40.9222226Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:40.9223399Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:40.9224503Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:40.9225917Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:40.9227362Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:40.9228484Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:40.9229440Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:40.9230221Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:40.9231292Z W0507 20:32:40.916000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.1109972Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:41.1111103Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:41.1112734Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:41.1114244Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:41.1115382Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:41.1116765Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:41.1118242Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.1119281Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:41.1120582Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:41.1122095Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.1123221Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:41.1124585Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:41.1126093Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:41.1127384Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:41.1128653Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:41.1129515Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:41.1130600Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:41.1131725Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:41.1132558Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:32:41.1133828Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:41.1135237Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:41.1136533Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:41.1137635Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:41.1138979Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:41.1140416Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:41.1141535Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.1142495Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.1143267Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:41.1144335Z W0507 20:32:41.108000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.1206516Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:41.1208734Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last): 2025-05-07T20:32:41.1211102Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:41.1212607Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:41.1213640Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:41.1215110Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:41.1216578Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.1217622Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:41.1218920Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:41.1220383Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.1221505Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:41.1222974Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:41.1224304Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] generator.visit(fn.parse()) 2025-05-07T20:32:41.1225836Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:41.1227281Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ret = super().visit(node) 2025-05-07T20:32:41.1228144Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:41.1229225Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:41.1230309Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] return visitor(node) 2025-05-07T20:32:41.1231149Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^^^^^^^^^^^^^ 2025-05-07T20:32:41.1232433Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:41.1233793Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:41.1234980Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:41.1236085Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] self.visit(item) 2025-05-07T20:32:41.1237335Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:41.1238769Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:41.1239884Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.1240845Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.1241626Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^ 2025-05-07T20:32:41.1242694Z W0507 20:32:41.117000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.8738810Z self = 2025-05-07T20:32:41.8739606Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:41.8740014Z 2025-05-07T20:32:41.8740141Z @given( 2025-05-07T20:32:41.8740424Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.8740765Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.8741093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.8741448Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.8741798Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.8742108Z ) 2025-05-07T20:32:41.8742821Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.8743295Z def test_silu_mul_quant( 2025-05-07T20:32:41.8743551Z self, 2025-05-07T20:32:41.8743749Z T: int, 2025-05-07T20:32:41.8743955Z D: int, 2025-05-07T20:32:41.8744333Z scale_ub: Optional[float], 2025-05-07T20:32:41.8744610Z contiguous: bool, 2025-05-07T20:32:41.8744859Z compiled: bool, 2025-05-07T20:32:41.8745092Z ) -> None: 2025-05-07T20:32:41.8745309Z torch.manual_seed(2025) 2025-05-07T20:32:41.8745562Z 2025-05-07T20:32:41.8745852Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.8746202Z 2025-05-07T20:32:41.8746407Z x_sign = torch.sign(x) 2025-05-07T20:32:41.8746709Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.8747023Z x = x_sign * x_clamp 2025-05-07T20:32:41.8747275Z x0 = x[:, :D] 2025-05-07T20:32:41.8747498Z x1 = x[:, D:] 2025-05-07T20:32:41.8747715Z 2025-05-07T20:32:41.8747907Z if contiguous: 2025-05-07T20:32:41.8748147Z x0 = x0.contiguous() 2025-05-07T20:32:41.8748415Z x1 = x1.contiguous() 2025-05-07T20:32:41.8748661Z 2025-05-07T20:32:41.8748862Z if scale_ub is not None: 2025-05-07T20:32:41.8749153Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.8749492Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.8749814Z ) 2025-05-07T20:32:41.8750022Z else: 2025-05-07T20:32:41.8750233Z scale_ub_tensor = None 2025-05-07T20:32:41.8750503Z 2025-05-07T20:32:41.8750750Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.8751112Z op = silu_mul_quant 2025-05-07T20:32:41.8751382Z if compiled: 2025-05-07T20:32:41.8751642Z op = torch.compile(op) 2025-05-07T20:32:41.8751946Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8752236Z 2025-05-07T20:32:41.8752444Z > y_fp8, y_scale = fn() 2025-05-07T20:32:41.8752613Z 2025-05-07T20:32:41.8752722Z moe/activation_test.py:117: 2025-05-07T20:32:41.8753017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8753365Z moe/activation_test.py:115: in fn 2025-05-07T20:32:41.8753655Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8754377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:41.8755108Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:41.8755681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.8756401Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.8757095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.8757666Z kernel = self.compile( 2025-05-07T20:32:41.8758238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.8758926Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.8759351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8759597Z 2025-05-07T20:32:41.8759810Z self = 2025-05-07T20:32:41.8760943Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.8762389Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1295e4a8e0>} 2025-05-07T20:32:41.8763890Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.8764984Z context = 2025-05-07T20:32:41.8765367Z 2025-05-07T20:32:41.8765551Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.8766114Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.8766602Z module_map=module_map) 2025-05-07T20:32:41.8766990Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.8767371Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:41.8767641Z E ^ 2025-05-07T20:32:41.8768134Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.8768610Z 2025-05-07T20:32:41.8769068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.8769615Z 2025-05-07T20:32:41.8769736Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8770168Z self=, 2025-05-07T20:32:41.8770605Z T=1, 2025-05-07T20:32:41.8770808Z D=7168, 2025-05-07T20:32:41.8771009Z scale_ub=None, 2025-05-07T20:32:41.8771240Z contiguous=True, 2025-05-07T20:32:41.8771482Z compiled=True, 2025-05-07T20:32:41.8771692Z ) 2025-05-07T20:32:41.8772030Z self = 2025-05-07T20:32:41.8772544Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:41.8772821Z 2025-05-07T20:32:41.8772902Z @given( 2025-05-07T20:32:41.8773148Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:41.8773476Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:41.8773799Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:41.8774148Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:41.8774635Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:41.8774954Z ) 2025-05-07T20:32:41.8775311Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:41.8775780Z def test_silu_mul_quant( 2025-05-07T20:32:41.8776035Z self, 2025-05-07T20:32:41.8776237Z T: int, 2025-05-07T20:32:41.8776439Z D: int, 2025-05-07T20:32:41.8776670Z scale_ub: Optional[float], 2025-05-07T20:32:41.8776952Z contiguous: bool, 2025-05-07T20:32:41.8777211Z compiled: bool, 2025-05-07T20:32:41.8777445Z ) -> None: 2025-05-07T20:32:41.8777660Z torch.manual_seed(2025) 2025-05-07T20:32:41.8777918Z 2025-05-07T20:32:41.8778209Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:41.8778566Z 2025-05-07T20:32:41.8778783Z x_sign = torch.sign(x) 2025-05-07T20:32:41.8779093Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:41.8779412Z x = x_sign * x_clamp 2025-05-07T20:32:41.8779668Z x0 = x[:, :D] 2025-05-07T20:32:41.8779900Z x1 = x[:, D:] 2025-05-07T20:32:41.8780127Z 2025-05-07T20:32:41.8780320Z if contiguous: 2025-05-07T20:32:41.8780565Z x0 = x0.contiguous() 2025-05-07T20:32:41.8780834Z x1 = x1.contiguous() 2025-05-07T20:32:41.8781083Z 2025-05-07T20:32:41.8781314Z if scale_ub is not None: 2025-05-07T20:32:41.8781604Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:41.8781943Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:41.8782261Z ) 2025-05-07T20:32:41.8782456Z else: 2025-05-07T20:32:41.8782672Z scale_ub_tensor = None 2025-05-07T20:32:41.8782945Z 2025-05-07T20:32:41.8783271Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.8783598Z op = silu_mul_quant 2025-05-07T20:32:41.8783864Z if compiled: 2025-05-07T20:32:41.8784125Z op = torch.compile(op) 2025-05-07T20:32:41.8784436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:41.8784809Z 2025-05-07T20:32:41.8785014Z y_fp8, y_scale = fn() 2025-05-07T20:32:41.8785308Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:41.8785625Z 2025-05-07T20:32:41.8785879Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:41.8786232Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:41.8786537Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:41.8786867Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:41.8787248Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:41.8787571Z 2025-05-07T20:32:41.8787789Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:41.8788001Z 2025-05-07T20:32:41.8788112Z moe/activation_test.py:126: 2025-05-07T20:32:41.8788415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8788773Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:41.8789126Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:41.8789963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:41.8790754Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:41.8791379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:41.8792104Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:41.8792833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:41.8793595Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:41.8794366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:41.8795041Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:41.8795669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:41.8796213Z fn() 2025-05-07T20:32:41.8796748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:41.8797362Z self.fn.run( 2025-05-07T20:32:41.8797845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:41.8798404Z kernel = self.compile( 2025-05-07T20:32:41.8798970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:41.8799657Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:41.8800066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:41.8800310Z 2025-05-07T20:32:41.8800533Z self = 2025-05-07T20:32:41.8801665Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:41.8803087Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f129587c860>} 2025-05-07T20:32:41.8804571Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:41.8805659Z context = 2025-05-07T20:32:41.8805959Z 2025-05-07T20:32:41.8806136Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:41.8806793Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:41.8807269Z module_map=module_map) 2025-05-07T20:32:41.8807645Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:41.8808013Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:41.8808282Z E ^ 2025-05-07T20:32:41.8808763Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:41.8809233Z 2025-05-07T20:32:41.8809676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:41.8810214Z 2025-05-07T20:32:41.8810332Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:41.8810752Z self=, 2025-05-07T20:32:41.8811169Z T=4096, 2025-05-07T20:32:41.8811364Z D=5120, 2025-05-07T20:32:41.8811562Z scale_ub=None, 2025-05-07T20:32:41.8811779Z contiguous=False, 2025-05-07T20:32:41.8812009Z compiled=False, 2025-05-07T20:32:41.8812213Z ) 2025-05-07T20:32:42.1777050Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:42.1778197Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:42.1779652Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:42.1781259Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:42.1782294Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.1783661Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:42.1785123Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.1786158Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.1787450Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:42.1788916Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.1790029Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.1791761Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:42.1793088Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:42.1794532Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:42.1795806Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:42.1796668Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.1797749Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:42.1798823Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:42.1799657Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:32:42.1800944Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:42.1802289Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:42.1803461Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:42.1804562Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:42.1805803Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:42.1807236Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:42.1808338Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.1809287Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.1810062Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:42.1811131Z W0507 20:32:42.174000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.3828874Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:42.3830006Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:42.3831420Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:42.3833307Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:42.3834339Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.3835865Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:42.3837326Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.3838361Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.3839663Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:42.3841153Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.3842300Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.3843643Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:42.3844967Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:42.3846253Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:42.3847536Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:42.3848406Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.3849480Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:42.3850563Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:42.3851397Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:32:42.3852675Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:42.3854027Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:42.3855364Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:42.3856460Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:42.3857788Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:42.3859326Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:42.3860433Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.3861439Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.3862212Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:42.3863286Z W0507 20:32:42.379000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6756921Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:42.6759200Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:42.6761567Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:42.6763088Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:42.6764129Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.6765498Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:42.6766963Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6767997Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.6769299Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:42.6770754Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6771947Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.6773298Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:42.6774896Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:42.6776520Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:42.6777803Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:42.6778833Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.6779911Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:42.6780981Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:42.6781812Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:32:42.6783145Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:42.6784700Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:42.6785882Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:42.6795378Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:42.6796665Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:42.6798116Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:42.6799248Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6800200Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6800985Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:42.6802112Z W0507 20:32:42.672000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:42.6858981Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:42.6860070Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last): 2025-05-07T20:32:42.6861494Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:42.6862989Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:42.6864235Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.6865622Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:42.6867222Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:42.6868251Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.6869557Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:42.6871025Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:42.6872206Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.6873574Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:42.6874891Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] generator.visit(fn.parse()) 2025-05-07T20:32:42.6876192Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:42.6877469Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ret = super().visit(node) 2025-05-07T20:32:42.6878345Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:42.6879428Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:42.6880495Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] return visitor(node) 2025-05-07T20:32:42.6881331Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^^^^^^^^^^^^^ 2025-05-07T20:32:42.6882615Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:42.6883973Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:42.6885231Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:42.6886452Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] self.visit(item) 2025-05-07T20:32:42.6887779Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:42.6889311Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:42.6890434Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:42.6891517Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:42.6892303Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^ 2025-05-07T20:32:42.6893375Z W0507 20:32:42.682000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.0506927Z self = 2025-05-07T20:32:44.0507548Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.0507851Z 2025-05-07T20:32:44.0507945Z @given( 2025-05-07T20:32:44.0508189Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.0508522Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.0508856Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.0509205Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.0509545Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.0509852Z ) 2025-05-07T20:32:44.0510217Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.0510675Z def test_silu_mul_quant( 2025-05-07T20:32:44.0510926Z self, 2025-05-07T20:32:44.0511131Z T: int, 2025-05-07T20:32:44.0511328Z D: int, 2025-05-07T20:32:44.0511553Z scale_ub: Optional[float], 2025-05-07T20:32:44.0511835Z contiguous: bool, 2025-05-07T20:32:44.0512086Z compiled: bool, 2025-05-07T20:32:44.0512325Z ) -> None: 2025-05-07T20:32:44.0512549Z torch.manual_seed(2025) 2025-05-07T20:32:44.0512797Z 2025-05-07T20:32:44.0513093Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.0513467Z 2025-05-07T20:32:44.0513659Z x_sign = torch.sign(x) 2025-05-07T20:32:44.0513963Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.0514283Z x = x_sign * x_clamp 2025-05-07T20:32:44.0514533Z x0 = x[:, :D] 2025-05-07T20:32:44.0514753Z x1 = x[:, D:] 2025-05-07T20:32:44.0514972Z 2025-05-07T20:32:44.0515170Z if contiguous: 2025-05-07T20:32:44.0515403Z x0 = x0.contiguous() 2025-05-07T20:32:44.0515670Z x1 = x1.contiguous() 2025-05-07T20:32:44.0515919Z 2025-05-07T20:32:44.0516108Z if scale_ub is not None: 2025-05-07T20:32:44.0516390Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.0516743Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.0517062Z ) 2025-05-07T20:32:44.0517264Z else: 2025-05-07T20:32:44.0517485Z scale_ub_tensor = None 2025-05-07T20:32:44.0517740Z 2025-05-07T20:32:44.0517982Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.0518323Z op = silu_mul_quant 2025-05-07T20:32:44.0518571Z if compiled: 2025-05-07T20:32:44.0518832Z op = torch.compile(op) 2025-05-07T20:32:44.0519142Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.0519436Z 2025-05-07T20:32:44.0519627Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.0519804Z 2025-05-07T20:32:44.0519909Z moe/activation_test.py:117: 2025-05-07T20:32:44.0520231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.0520570Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.0520868Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.0522007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.0522752Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.0523328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.0524270Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.0525003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.0525934Z kernel = self.compile( 2025-05-07T20:32:44.0526531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.0527252Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.0527693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.0527941Z 2025-05-07T20:32:44.0528171Z self = 2025-05-07T20:32:44.0529321Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.0530796Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127b0a8180>} 2025-05-07T20:32:44.0532232Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.0533331Z context = 2025-05-07T20:32:44.0533647Z 2025-05-07T20:32:44.0533826Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.0534495Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.0535028Z module_map=module_map) 2025-05-07T20:32:44.0535410Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.0535796Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.0536079Z E ^ 2025-05-07T20:32:44.0536569Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.0537058Z 2025-05-07T20:32:44.0537503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.0538063Z 2025-05-07T20:32:44.0538172Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.0538612Z self=, 2025-05-07T20:32:44.0539041Z T=4096, 2025-05-07T20:32:44.0539254Z D=7168, 2025-05-07T20:32:44.0539452Z scale_ub=None, 2025-05-07T20:32:44.0539667Z contiguous=False, 2025-05-07T20:32:44.0539905Z compiled=False, 2025-05-07T20:32:44.0540118Z ) 2025-05-07T20:32:44.0540442Z self = 2025-05-07T20:32:44.0540968Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.0541265Z 2025-05-07T20:32:44.0541347Z @given( 2025-05-07T20:32:44.0541592Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.0541917Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.0542240Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.0542587Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.0542930Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.0543234Z ) 2025-05-07T20:32:44.0543600Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.0544268Z def test_silu_mul_quant( 2025-05-07T20:32:44.0544526Z self, 2025-05-07T20:32:44.0544739Z T: int, 2025-05-07T20:32:44.0544971Z D: int, 2025-05-07T20:32:44.0545199Z scale_ub: Optional[float], 2025-05-07T20:32:44.0545626Z contiguous: bool, 2025-05-07T20:32:44.0545885Z compiled: bool, 2025-05-07T20:32:44.0546112Z ) -> None: 2025-05-07T20:32:44.0546336Z torch.manual_seed(2025) 2025-05-07T20:32:44.0546584Z 2025-05-07T20:32:44.0546872Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.0547227Z 2025-05-07T20:32:44.0547427Z x_sign = torch.sign(x) 2025-05-07T20:32:44.0547729Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.0548046Z x = x_sign * x_clamp 2025-05-07T20:32:44.0548299Z x0 = x[:, :D] 2025-05-07T20:32:44.0548525Z x1 = x[:, D:] 2025-05-07T20:32:44.0548733Z 2025-05-07T20:32:44.0548925Z if contiguous: 2025-05-07T20:32:44.0549170Z x0 = x0.contiguous() 2025-05-07T20:32:44.0549434Z x1 = x1.contiguous() 2025-05-07T20:32:44.0549686Z 2025-05-07T20:32:44.0549890Z if scale_ub is not None: 2025-05-07T20:32:44.0550165Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.0550526Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.0550853Z ) 2025-05-07T20:32:44.0551052Z else: 2025-05-07T20:32:44.0551277Z scale_ub_tensor = None 2025-05-07T20:32:44.0551577Z 2025-05-07T20:32:44.0551844Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.0552166Z op = silu_mul_quant 2025-05-07T20:32:44.0552427Z if compiled: 2025-05-07T20:32:44.0552687Z op = torch.compile(op) 2025-05-07T20:32:44.0552989Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.0553283Z 2025-05-07T20:32:44.0553481Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.0553648Z 2025-05-07T20:32:44.0553752Z moe/activation_test.py:117: 2025-05-07T20:32:44.0554067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.0554414Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.0554704Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.0555442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.0556307Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.0556884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.0557697Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.0558487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.0559132Z kernel = self.compile( 2025-05-07T20:32:44.0559719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.0560414Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.0560832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.0561081Z 2025-05-07T20:32:44.0561302Z self = 2025-05-07T20:32:44.0562426Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.0563875Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127b0a9080>} 2025-05-07T20:32:44.0566261Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.0567364Z context = 2025-05-07T20:32:44.0567752Z 2025-05-07T20:32:44.0567921Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.0568460Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.0568946Z module_map=module_map) 2025-05-07T20:32:44.0569315Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.0569676Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.0569940Z E ^ 2025-05-07T20:32:44.0570413Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.0570891Z 2025-05-07T20:32:44.0571338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.0571892Z 2025-05-07T20:32:44.0571996Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.0572422Z self=, 2025-05-07T20:32:44.0572837Z T=128, 2025-05-07T20:32:44.0573031Z D=7168, 2025-05-07T20:32:44.0573224Z scale_ub=None, 2025-05-07T20:32:44.0573437Z contiguous=False, 2025-05-07T20:32:44.0573663Z compiled=True, 2025-05-07T20:32:44.0573872Z ) 2025-05-07T20:32:44.1149753Z self = 2025-05-07T20:32:44.1150302Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:44.1150587Z 2025-05-07T20:32:44.1150670Z @given( 2025-05-07T20:32:44.1150910Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.1151237Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.1151598Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.1151974Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.1152315Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.1152617Z ) 2025-05-07T20:32:44.1152975Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.1153450Z def test_silu_mul_quant( 2025-05-07T20:32:44.1153701Z self, 2025-05-07T20:32:44.1153898Z T: int, 2025-05-07T20:32:44.1154099Z D: int, 2025-05-07T20:32:44.1154326Z scale_ub: Optional[float], 2025-05-07T20:32:44.1154603Z contiguous: bool, 2025-05-07T20:32:44.1154852Z compiled: bool, 2025-05-07T20:32:44.1155087Z ) -> None: 2025-05-07T20:32:44.1155300Z torch.manual_seed(2025) 2025-05-07T20:32:44.1155552Z 2025-05-07T20:32:44.1155842Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.1156195Z 2025-05-07T20:32:44.1156394Z x_sign = torch.sign(x) 2025-05-07T20:32:44.1156700Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.1157017Z x = x_sign * x_clamp 2025-05-07T20:32:44.1157264Z x0 = x[:, :D] 2025-05-07T20:32:44.1157489Z x1 = x[:, D:] 2025-05-07T20:32:44.1157706Z 2025-05-07T20:32:44.1157890Z if contiguous: 2025-05-07T20:32:44.1158134Z x0 = x0.contiguous() 2025-05-07T20:32:44.1158405Z x1 = x1.contiguous() 2025-05-07T20:32:44.1158652Z 2025-05-07T20:32:44.1158853Z if scale_ub is not None: 2025-05-07T20:32:44.1159137Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.1159475Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.1159797Z ) 2025-05-07T20:32:44.1159997Z else: 2025-05-07T20:32:44.1160210Z scale_ub_tensor = None 2025-05-07T20:32:44.1160473Z 2025-05-07T20:32:44.1160710Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.1161265Z op = silu_mul_quant 2025-05-07T20:32:44.1161528Z if compiled: 2025-05-07T20:32:44.1161806Z op = torch.compile(op) 2025-05-07T20:32:44.1162127Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.1162411Z 2025-05-07T20:32:44.1162765Z y_fp8, y_scale = fn() 2025-05-07T20:32:44.1163057Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:44.1163355Z 2025-05-07T20:32:44.1163598Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.1163945Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:44.1164239Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:44.1164564Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:44.1164935Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.1165251Z 2025-05-07T20:32:44.1165458Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:44.1165683Z 2025-05-07T20:32:44.1165796Z moe/activation_test.py:126: 2025-05-07T20:32:44.1166105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.1166455Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:44.1166797Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:44.1167655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:44.1168587Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:44.1169158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.1169964Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.1170741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:44.1171609Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:44.1172393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:44.1173073Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:44.1173721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:44.1174268Z fn() 2025-05-07T20:32:44.1174936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:44.1175556Z self.fn.run( 2025-05-07T20:32:44.1176038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.1176600Z kernel = self.compile( 2025-05-07T20:32:44.1177168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.1177862Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.1178266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.1178510Z 2025-05-07T20:32:44.1178723Z self = 2025-05-07T20:32:44.1179854Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.1181294Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127b0a9f80>} 2025-05-07T20:32:44.1182703Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.1183890Z context = 2025-05-07T20:32:44.1184204Z 2025-05-07T20:32:44.1184379Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.1184933Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.1185528Z module_map=module_map) 2025-05-07T20:32:44.1185910Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.1186289Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:44.1186574Z E ^ 2025-05-07T20:32:44.1187057Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.1187539Z 2025-05-07T20:32:44.1187979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.1188524Z 2025-05-07T20:32:44.1188646Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.1189080Z self=, 2025-05-07T20:32:44.1189500Z T=128, 2025-05-07T20:32:44.1189700Z D=7168, 2025-05-07T20:32:44.1189900Z scale_ub=None, 2025-05-07T20:32:44.1190122Z contiguous=False, 2025-05-07T20:32:44.1190361Z compiled=False, 2025-05-07T20:32:44.1190576Z ) 2025-05-07T20:32:44.3138811Z self = 2025-05-07T20:32:44.3139381Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:44.3139663Z 2025-05-07T20:32:44.3139749Z @given( 2025-05-07T20:32:44.3139985Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3140303Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3140606Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3140940Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3141304Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3141591Z ) 2025-05-07T20:32:44.3141944Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3142400Z def test_silu_mul_quant( 2025-05-07T20:32:44.3142652Z self, 2025-05-07T20:32:44.3142842Z T: int, 2025-05-07T20:32:44.3143039Z D: int, 2025-05-07T20:32:44.3143259Z scale_ub: Optional[float], 2025-05-07T20:32:44.3143527Z contiguous: bool, 2025-05-07T20:32:44.3143768Z compiled: bool, 2025-05-07T20:32:44.3144001Z ) -> None: 2025-05-07T20:32:44.3144219Z torch.manual_seed(2025) 2025-05-07T20:32:44.3144466Z 2025-05-07T20:32:44.3144744Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3145090Z 2025-05-07T20:32:44.3145286Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3145579Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3145885Z x = x_sign * x_clamp 2025-05-07T20:32:44.3146132Z x0 = x[:, :D] 2025-05-07T20:32:44.3146348Z x1 = x[:, D:] 2025-05-07T20:32:44.3146546Z 2025-05-07T20:32:44.3146733Z if contiguous: 2025-05-07T20:32:44.3146971Z x0 = x0.contiguous() 2025-05-07T20:32:44.3147235Z x1 = x1.contiguous() 2025-05-07T20:32:44.3147473Z 2025-05-07T20:32:44.3147665Z if scale_ub is not None: 2025-05-07T20:32:44.3147940Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3148271Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3148584Z ) 2025-05-07T20:32:44.3148773Z else: 2025-05-07T20:32:44.3148977Z scale_ub_tensor = None 2025-05-07T20:32:44.3149231Z 2025-05-07T20:32:44.3149457Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3149767Z op = silu_mul_quant 2025-05-07T20:32:44.3150017Z if compiled: 2025-05-07T20:32:44.3150266Z op = torch.compile(op) 2025-05-07T20:32:44.3150907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3151199Z 2025-05-07T20:32:44.3151398Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3151569Z 2025-05-07T20:32:44.3151693Z moe/activation_test.py:117: 2025-05-07T20:32:44.3152165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3152512Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3152800Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3153520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3154251Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3154812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3155528Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3156233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3156802Z kernel = self.compile( 2025-05-07T20:32:44.3157366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3158054Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3158476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3158720Z 2025-05-07T20:32:44.3158934Z self = 2025-05-07T20:32:44.3160064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3161515Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ab52340>} 2025-05-07T20:32:44.3162935Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3164023Z context = 2025-05-07T20:32:44.3164330Z 2025-05-07T20:32:44.3164501Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3165048Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3165540Z module_map=module_map) 2025-05-07T20:32:44.3165910Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3166274Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3175723Z E ^ 2025-05-07T20:32:44.3176266Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3176754Z 2025-05-07T20:32:44.3177199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3177757Z 2025-05-07T20:32:44.3177876Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3178305Z self=, 2025-05-07T20:32:44.3178735Z T=4096, 2025-05-07T20:32:44.3178939Z D=5120, 2025-05-07T20:32:44.3179145Z scale_ub=1200.0, 2025-05-07T20:32:44.3179371Z contiguous=True, 2025-05-07T20:32:44.3179602Z compiled=False, 2025-05-07T20:32:44.3179823Z ) 2025-05-07T20:32:44.3180150Z self = 2025-05-07T20:32:44.3180668Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:44.3180955Z 2025-05-07T20:32:44.3181045Z @given( 2025-05-07T20:32:44.3181393Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:44.3181728Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:44.3182101Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:44.3182437Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:44.3182865Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:44.3183168Z ) 2025-05-07T20:32:44.3183531Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:44.3183991Z def test_silu_mul_quant( 2025-05-07T20:32:44.3184246Z self, 2025-05-07T20:32:44.3184456Z T: int, 2025-05-07T20:32:44.3184650Z D: int, 2025-05-07T20:32:44.3184874Z scale_ub: Optional[float], 2025-05-07T20:32:44.3185153Z contiguous: bool, 2025-05-07T20:32:44.3185389Z compiled: bool, 2025-05-07T20:32:44.3185625Z ) -> None: 2025-05-07T20:32:44.3185844Z torch.manual_seed(2025) 2025-05-07T20:32:44.3186086Z 2025-05-07T20:32:44.3186373Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:44.3186734Z 2025-05-07T20:32:44.3186925Z x_sign = torch.sign(x) 2025-05-07T20:32:44.3187226Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:44.3187552Z x = x_sign * x_clamp 2025-05-07T20:32:44.3187791Z x0 = x[:, :D] 2025-05-07T20:32:44.3188008Z x1 = x[:, D:] 2025-05-07T20:32:44.3188222Z 2025-05-07T20:32:44.3188406Z if contiguous: 2025-05-07T20:32:44.3188645Z x0 = x0.contiguous() 2025-05-07T20:32:44.3188910Z x1 = x1.contiguous() 2025-05-07T20:32:44.3189164Z 2025-05-07T20:32:44.3189355Z if scale_ub is not None: 2025-05-07T20:32:44.3189641Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:44.3189987Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:44.3190302Z ) 2025-05-07T20:32:44.3190499Z else: 2025-05-07T20:32:44.3190719Z scale_ub_tensor = None 2025-05-07T20:32:44.3190971Z 2025-05-07T20:32:44.3191209Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:44.3191533Z op = silu_mul_quant 2025-05-07T20:32:44.3191781Z if compiled: 2025-05-07T20:32:44.3192044Z op = torch.compile(op) 2025-05-07T20:32:44.3192351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3192631Z 2025-05-07T20:32:44.3192828Z > y_fp8, y_scale = fn() 2025-05-07T20:32:44.3193003Z 2025-05-07T20:32:44.3193102Z moe/activation_test.py:117: 2025-05-07T20:32:44.3193408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3193751Z moe/activation_test.py:115: in fn 2025-05-07T20:32:44.3194044Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:44.3194770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:44.3195501Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:44.3196061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:44.3196780Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:44.3197483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:44.3198037Z kernel = self.compile( 2025-05-07T20:32:44.3198607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:44.3199299Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.3199704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:44.3199951Z 2025-05-07T20:32:44.3200162Z self = 2025-05-07T20:32:44.3201376Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:44.3202867Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ab51440>} 2025-05-07T20:32:44.3204361Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:44.3205448Z context = 2025-05-07T20:32:44.3205756Z 2025-05-07T20:32:44.3205928Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:44.3206482Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.3206972Z module_map=module_map) 2025-05-07T20:32:44.3207346Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.3207717Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.3207996Z E ^ 2025-05-07T20:32:44.3208602Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.3209132Z 2025-05-07T20:32:44.3209647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:44.3210208Z 2025-05-07T20:32:44.3210318Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:44.3210815Z self=, 2025-05-07T20:32:44.3211301Z T=1, 2025-05-07T20:32:44.3211503Z D=5120, 2025-05-07T20:32:44.3211707Z scale_ub=None, 2025-05-07T20:32:44.3211933Z contiguous=True, 2025-05-07T20:32:44.3212178Z compiled=True, 2025-05-07T20:32:44.3212391Z ) 2025-05-07T20:32:44.5858919Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:44.5860206Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:44.5861812Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:44.5863377Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:44.5864428Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.5865826Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:44.5867304Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.5868353Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.5870036Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:44.5871513Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.5872643Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.5874140Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:44.5875472Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:44.5876780Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:44.5878068Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:44.5878951Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.5880029Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:44.5881108Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:44.5882003Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:32:44.5883300Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:44.5884664Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:44.5885844Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:44.5886962Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:44.5888224Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:44.5889662Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:44.5890771Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.5891738Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.5892517Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:44.5893591Z W0507 20:32:44.582000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.6567920Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:44.6569173Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:44.6570779Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:44.6572337Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:44.6573387Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.6574919Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:44.6576393Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.6577442Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.6578749Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:44.6580210Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.6581330Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.6582694Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:44.6584016Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:44.6585314Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:44.6586593Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:44.6587463Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.6588549Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:44.6589631Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:44.6590461Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:32:44.6591828Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:44.6593246Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:44.6594503Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:44.6595605Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:44.6596848Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:44.6598292Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:44.6599411Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.6600382Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.6601164Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:44.6602243Z W0507 20:32:44.653000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8633594Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:44.8635017Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:44.8636441Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:44.8637954Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:44.8638978Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8640367Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:44.8641835Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8642925Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8644225Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:44.8645670Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8646974Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8648333Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:44.8649759Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:44.8651047Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:44.8652324Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:44.8653188Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8654268Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:44.8655426Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:44.8656258Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:32:44.8657520Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:44.8658875Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:44.8660048Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:44.8661150Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:44.8662399Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:44.8663820Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:44.8664937Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8665888Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8666664Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:44.8667731Z W0507 20:32:44.860000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:44.8733872Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:44.8735242Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last): 2025-05-07T20:32:44.8736819Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:44.8738469Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:44.8739500Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8740884Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:44.8742348Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:44.8743388Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8744693Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:44.8746145Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:44.8747275Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8748619Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:44.8749938Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] generator.visit(fn.parse()) 2025-05-07T20:32:44.8751221Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:44.8752553Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ret = super().visit(node) 2025-05-07T20:32:44.8753428Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:44.8754508Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:44.8755582Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] return visitor(node) 2025-05-07T20:32:44.8756421Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^^^^^^^^^^^^^ 2025-05-07T20:32:44.8757701Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:44.8759047Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:44.8760308Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:44.8761415Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] self.visit(item) 2025-05-07T20:32:44.8762783Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:44.8764219Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:44.8765325Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:44.8766285Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:44.8767059Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^ 2025-05-07T20:32:44.8768134Z W0507 20:32:44.870000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.0941881Z self = 2025-05-07T20:32:45.0942687Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:45.0943061Z 2025-05-07T20:32:45.0943173Z @given( 2025-05-07T20:32:45.0943496Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:45.0943852Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:45.0944171Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:45.0944537Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:45.0944889Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:45.0945193Z ) 2025-05-07T20:32:45.0945554Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:45.0946029Z def test_silu_mul_quant( 2025-05-07T20:32:45.0946283Z self, 2025-05-07T20:32:45.0946474Z T: int, 2025-05-07T20:32:45.0946676Z D: int, 2025-05-07T20:32:45.0946901Z scale_ub: Optional[float], 2025-05-07T20:32:45.0947175Z contiguous: bool, 2025-05-07T20:32:45.0947427Z compiled: bool, 2025-05-07T20:32:45.0947659Z ) -> None: 2025-05-07T20:32:45.0947878Z torch.manual_seed(2025) 2025-05-07T20:32:45.0948132Z 2025-05-07T20:32:45.0948417Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:45.0948771Z 2025-05-07T20:32:45.0948973Z x_sign = torch.sign(x) 2025-05-07T20:32:45.0949280Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:45.0949591Z x = x_sign * x_clamp 2025-05-07T20:32:45.0949834Z x0 = x[:, :D] 2025-05-07T20:32:45.0950053Z x1 = x[:, D:] 2025-05-07T20:32:45.0950259Z 2025-05-07T20:32:45.0950451Z if contiguous: 2025-05-07T20:32:45.0950692Z x0 = x0.contiguous() 2025-05-07T20:32:45.0950951Z x1 = x1.contiguous() 2025-05-07T20:32:45.0951197Z 2025-05-07T20:32:45.0951387Z if scale_ub is not None: 2025-05-07T20:32:45.0951667Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:45.0952002Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:45.0952368Z ) 2025-05-07T20:32:45.0952565Z else: 2025-05-07T20:32:45.0952770Z scale_ub_tensor = None 2025-05-07T20:32:45.0953029Z 2025-05-07T20:32:45.0953266Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.0953580Z op = silu_mul_quant 2025-05-07T20:32:45.0954005Z if compiled: 2025-05-07T20:32:45.0954263Z op = torch.compile(op) 2025-05-07T20:32:45.0954559Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.0954852Z 2025-05-07T20:32:45.0955049Z y_fp8, y_scale = fn() 2025-05-07T20:32:45.0955447Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:45.0955748Z 2025-05-07T20:32:45.0955986Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.0956331Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:45.0956623Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:45.0956942Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:45.0957313Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:45.0957631Z 2025-05-07T20:32:45.0957837Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:45.0958037Z 2025-05-07T20:32:45.0958146Z moe/activation_test.py:126: 2025-05-07T20:32:45.0958450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.0958798Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:45.0959134Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:45.0959970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:45.0960766Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:45.0961340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:45.0962060Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:45.0962782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:45.0963544Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:45.0964318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:45.0964996Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:45.0965620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:45.0966174Z fn() 2025-05-07T20:32:45.0966701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:45.0967314Z self.fn.run( 2025-05-07T20:32:45.0967791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:45.0968347Z kernel = self.compile( 2025-05-07T20:32:45.0968907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:45.0969585Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.0970013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.0970259Z 2025-05-07T20:32:45.0970467Z self = 2025-05-07T20:32:45.0971593Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:45.0973031Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ab53060>} 2025-05-07T20:32:45.0974524Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:45.0976143Z context = 2025-05-07T20:32:45.0976453Z 2025-05-07T20:32:45.0976622Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:45.0977170Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.0977733Z module_map=module_map) 2025-05-07T20:32:45.0978113Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.0978488Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:45.0978760Z E ^ 2025-05-07T20:32:45.0979246Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.0979726Z 2025-05-07T20:32:45.0980162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:45.0980705Z 2025-05-07T20:32:45.0980819Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.0981249Z self=, 2025-05-07T20:32:45.0981677Z T=2048, 2025-05-07T20:32:45.0981870Z D=5120, 2025-05-07T20:32:45.0982058Z scale_ub=None, 2025-05-07T20:32:45.0982270Z contiguous=True, 2025-05-07T20:32:45.0982502Z compiled=True, 2025-05-07T20:32:45.0982700Z ) 2025-05-07T20:32:45.3370585Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:45.3371903Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:45.3373371Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:45.3374993Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:45.3376039Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.3377418Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:45.3378879Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.3379922Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.3381224Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:45.3382736Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.3383862Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.3385224Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:45.3386747Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:45.3388046Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:45.3389438Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:45.3390312Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.3391401Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:45.3392487Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:45.3393316Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:32:45.3394601Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:45.3395968Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:45.3397150Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:45.3398264Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:45.3399506Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:45.3400952Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:45.3402098Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.3403083Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:45.3403859Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:45.3404932Z W0507 20:32:45.334000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.4070457Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:45.4071820Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:45.4074171Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:45.4077592Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:45.4079638Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.4082373Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:45.4083971Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.4085013Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.4086320Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:45.4087787Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.4088917Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.4090270Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:45.4091592Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:45.4092890Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:45.4094174Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:45.4095167Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.4096244Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:45.4097323Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:45.4106559Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:32:45.4107866Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:45.4109236Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:45.4110419Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:45.4111526Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:45.4112878Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:45.4114321Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:45.4115521Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.4116486Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:45.4117259Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:45.4118344Z W0507 20:32:45.404000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.6116792Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:45.6118123Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:45.6119536Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:45.6121044Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:45.6122083Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.6123466Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:45.6124938Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.6126143Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.6127441Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:45.6128906Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.6130027Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.6131385Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:45.6132709Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:45.6134188Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:45.6135585Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:45.6136586Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.6137673Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:45.6138748Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:45.6139581Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:32:45.6140872Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:45.6142231Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:45.6143467Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:45.6144575Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:45.6145813Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:45.6147259Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:45.6148377Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.6149339Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:45.6150111Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:45.6151551Z W0507 20:32:45.608000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.6217406Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:45.6218544Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last): 2025-05-07T20:32:45.6219976Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:45.6221491Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:45.6222577Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.6224135Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:45.6226056Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.6227664Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.6228981Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:45.6230465Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.6231590Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.6232957Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:45.6234286Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] generator.visit(fn.parse()) 2025-05-07T20:32:45.6235585Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:45.6236873Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ret = super().visit(node) 2025-05-07T20:32:45.6237737Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:45.6238834Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:45.6239920Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] return visitor(node) 2025-05-07T20:32:45.6240766Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^^^^^^^^^^^^^ 2025-05-07T20:32:45.6242058Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:45.6243420Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:45.6244613Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:45.6245727Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] self.visit(item) 2025-05-07T20:32:45.6246985Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:45.6248571Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:45.6249693Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.6250661Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:45.6251535Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^ 2025-05-07T20:32:45.6252664Z W0507 20:32:45.618000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.8306683Z self = 2025-05-07T20:32:45.8307265Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:45.8307559Z 2025-05-07T20:32:45.8307674Z @given( 2025-05-07T20:32:45.8307922Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:45.8308263Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:45.8308582Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:45.8308945Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:45.8309296Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:45.8309600Z ) 2025-05-07T20:32:45.8309965Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:45.8310432Z def test_silu_mul_quant( 2025-05-07T20:32:45.8310676Z self, 2025-05-07T20:32:45.8310884Z T: int, 2025-05-07T20:32:45.8311088Z D: int, 2025-05-07T20:32:45.8311303Z scale_ub: Optional[float], 2025-05-07T20:32:45.8311592Z contiguous: bool, 2025-05-07T20:32:45.8311843Z compiled: bool, 2025-05-07T20:32:45.8312068Z ) -> None: 2025-05-07T20:32:45.8312304Z torch.manual_seed(2025) 2025-05-07T20:32:45.8312561Z 2025-05-07T20:32:45.8312847Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:45.8313197Z 2025-05-07T20:32:45.8313399Z x_sign = torch.sign(x) 2025-05-07T20:32:45.8313705Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:45.8314026Z x = x_sign * x_clamp 2025-05-07T20:32:45.8314274Z x0 = x[:, :D] 2025-05-07T20:32:45.8314496Z x1 = x[:, D:] 2025-05-07T20:32:45.8314699Z 2025-05-07T20:32:45.8314892Z if contiguous: 2025-05-07T20:32:45.8315131Z x0 = x0.contiguous() 2025-05-07T20:32:45.8315391Z x1 = x1.contiguous() 2025-05-07T20:32:45.8315639Z 2025-05-07T20:32:45.8315835Z if scale_ub is not None: 2025-05-07T20:32:45.8316110Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:45.8316455Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:45.8316774Z ) 2025-05-07T20:32:45.8316967Z else: 2025-05-07T20:32:45.8317191Z scale_ub_tensor = None 2025-05-07T20:32:45.8317456Z 2025-05-07T20:32:45.8317689Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.8318021Z op = silu_mul_quant 2025-05-07T20:32:45.8318285Z if compiled: 2025-05-07T20:32:45.8318548Z op = torch.compile(op) 2025-05-07T20:32:45.8318849Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:45.8319143Z 2025-05-07T20:32:45.8319343Z y_fp8, y_scale = fn() 2025-05-07T20:32:45.8319631Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:45.8319936Z 2025-05-07T20:32:45.8320183Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:45.8320527Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:45.8320840Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:45.8321185Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:45.8321933Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:45.8322272Z 2025-05-07T20:32:45.8322490Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:45.8322693Z 2025-05-07T20:32:45.8322808Z moe/activation_test.py:126: 2025-05-07T20:32:45.8323266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.8323628Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:45.8323970Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:45.8324797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:45.8325886Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:45.8326471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:45.8327347Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:45.8328086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:45.8328859Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:45.8329648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:45.8330333Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:45.8330970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:45.8331530Z fn() 2025-05-07T20:32:45.8332073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:45.8332696Z self.fn.run( 2025-05-07T20:32:45.8333196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:45.8333771Z kernel = self.compile( 2025-05-07T20:32:45.8334342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:45.8335110Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:45.8335540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:45.8335783Z 2025-05-07T20:32:45.8336007Z self = 2025-05-07T20:32:45.8337141Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:45.8338601Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1279cf4860>} 2025-05-07T20:32:45.8340031Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:45.8341136Z context = 2025-05-07T20:32:45.8341448Z 2025-05-07T20:32:45.8341636Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:45.8342210Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:45.8342724Z module_map=module_map) 2025-05-07T20:32:45.8343110Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:45.8343477Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:45.8343764Z E ^ 2025-05-07T20:32:45.8344258Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:45.8344734Z 2025-05-07T20:32:45.8345343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:45.8345895Z 2025-05-07T20:32:45.8346006Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:45.8346457Z self=, 2025-05-07T20:32:45.8347007Z T=128, 2025-05-07T20:32:45.8347210Z D=5120, 2025-05-07T20:32:45.8347409Z scale_ub=None, 2025-05-07T20:32:45.8347643Z contiguous=True, 2025-05-07T20:32:45.8347878Z compiled=True, 2025-05-07T20:32:45.8348084Z ) 2025-05-07T20:32:46.0803373Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:46.0804520Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:46.0805983Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:46.0807528Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:46.0808576Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.0809977Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:46.0811457Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.0812510Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.0813835Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:46.0815384Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.0816502Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.0817868Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:46.0819217Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:46.0820528Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:46.0821818Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:46.0822693Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.0824207Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:46.0825308Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:46.0826558Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:32:46.0827852Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:46.0829214Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:46.0830412Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:46.0831519Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:46.0832838Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:46.0834279Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:46.0835394Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.0836368Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.0837160Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:46.0838245Z W0507 20:32:46.077000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.1510009Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:46.1511153Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:46.1512641Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:46.1514142Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:46.1515179Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.1516557Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:46.1518016Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.1519249Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.1520554Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:46.1522136Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.1523268Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.1524630Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:46.1526120Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:46.1527419Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:46.1528695Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:46.1529567Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.1530653Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:46.1531726Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:46.1532610Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:32:46.1533891Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:46.1535321Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:46.1536502Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:46.1537606Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:46.1538847Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:46.1540284Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:46.1541401Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.1542405Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.1543305Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:46.1544382Z W0507 20:32:46.148000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.3579313Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:46.3580436Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:46.3581848Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:46.3583364Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:46.3584389Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.3585778Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:46.3587241Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.3588276Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.3589579Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:46.3591045Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.3592165Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.3593524Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:46.3594853Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:46.3596146Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:46.3597436Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:46.3598305Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.3599391Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:46.3600617Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:46.3601461Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:32:46.3602740Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:46.3604238Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:46.3605422Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:46.3606531Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:46.3607787Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:46.3609225Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:46.3610345Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.3611306Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.3612087Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:46.3613209Z W0507 20:32:46.354000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.3680013Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:46.3681277Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last): 2025-05-07T20:32:46.3682688Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:46.3684239Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:46.3685270Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.3686644Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:46.3688225Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.3689259Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.3690707Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:46.3692172Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.3693409Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.3694825Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:46.3696148Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] generator.visit(fn.parse()) 2025-05-07T20:32:46.3697447Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:46.3698738Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ret = super().visit(node) 2025-05-07T20:32:46.3699606Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.3700691Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:46.3701777Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] return visitor(node) 2025-05-07T20:32:46.3702623Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^^^^^^^^^^^^^ 2025-05-07T20:32:46.3703907Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:46.3705272Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:46.3706455Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:46.3707559Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] self.visit(item) 2025-05-07T20:32:46.3708812Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:46.3710253Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:46.3711372Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.3712347Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.3713160Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^ 2025-05-07T20:32:46.3714318Z W0507 20:32:46.365000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.6149554Z self = 2025-05-07T20:32:46.6150180Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:46.6150574Z 2025-05-07T20:32:46.6150847Z @given( 2025-05-07T20:32:46.6151090Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:46.6151424Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:46.6151752Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:46.6152098Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:46.6152471Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:46.6152767Z ) 2025-05-07T20:32:46.6153136Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:46.6153612Z def test_silu_mul_quant( 2025-05-07T20:32:46.6153862Z self, 2025-05-07T20:32:46.6154068Z T: int, 2025-05-07T20:32:46.6154277Z D: int, 2025-05-07T20:32:46.6154508Z scale_ub: Optional[float], 2025-05-07T20:32:46.6154798Z contiguous: bool, 2025-05-07T20:32:46.6155055Z compiled: bool, 2025-05-07T20:32:46.6155285Z ) -> None: 2025-05-07T20:32:46.6155504Z torch.manual_seed(2025) 2025-05-07T20:32:46.6155760Z 2025-05-07T20:32:46.6156034Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:46.6156394Z 2025-05-07T20:32:46.6156590Z x_sign = torch.sign(x) 2025-05-07T20:32:46.6156882Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:46.6157205Z x = x_sign * x_clamp 2025-05-07T20:32:46.6157454Z x0 = x[:, :D] 2025-05-07T20:32:46.6157672Z x1 = x[:, D:] 2025-05-07T20:32:46.6157880Z 2025-05-07T20:32:46.6158067Z if contiguous: 2025-05-07T20:32:46.6158302Z x0 = x0.contiguous() 2025-05-07T20:32:46.6158563Z x1 = x1.contiguous() 2025-05-07T20:32:46.6158812Z 2025-05-07T20:32:46.6159008Z if scale_ub is not None: 2025-05-07T20:32:46.6159279Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:46.6159619Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:46.6159942Z ) 2025-05-07T20:32:46.6160144Z else: 2025-05-07T20:32:46.6160361Z scale_ub_tensor = None 2025-05-07T20:32:46.6160629Z 2025-05-07T20:32:46.6160864Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.6161183Z op = silu_mul_quant 2025-05-07T20:32:46.6161437Z if compiled: 2025-05-07T20:32:46.6161680Z op = torch.compile(op) 2025-05-07T20:32:46.6161988Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:46.6162272Z 2025-05-07T20:32:46.6162465Z y_fp8, y_scale = fn() 2025-05-07T20:32:46.6162754Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:46.6163053Z 2025-05-07T20:32:46.6163292Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:46.6163644Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:46.6163949Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:46.6164269Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:46.6164634Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:46.6164968Z 2025-05-07T20:32:46.6165174Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:46.6165375Z 2025-05-07T20:32:46.6165489Z moe/activation_test.py:126: 2025-05-07T20:32:46.6165786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.6166134Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:46.6166477Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:46.6167299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:46.6168084Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:46.6168780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:46.6169506Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:46.6170311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:46.6171068Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:46.6171834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:46.6180748Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:46.6181437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:46.6181986Z fn() 2025-05-07T20:32:46.6182590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:46.6183215Z self.fn.run( 2025-05-07T20:32:46.6183725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:46.6184301Z kernel = self.compile( 2025-05-07T20:32:46.6184875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:46.6185573Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.6185996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:46.6186250Z 2025-05-07T20:32:46.6186465Z self = 2025-05-07T20:32:46.6187602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:46.6189052Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127a6a5f80>} 2025-05-07T20:32:46.6190482Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:46.6191569Z context = 2025-05-07T20:32:46.6191879Z 2025-05-07T20:32:46.6192051Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:46.6192599Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.6193096Z module_map=module_map) 2025-05-07T20:32:46.6193471Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.6193853Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:46.6194140Z E ^ 2025-05-07T20:32:46.6194623Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.6195108Z 2025-05-07T20:32:46.6195555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:46.6196117Z 2025-05-07T20:32:46.6196224Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:46.6196659Z self=, 2025-05-07T20:32:46.6197084Z T=4096, 2025-05-07T20:32:46.6197288Z D=5120, 2025-05-07T20:32:46.6197502Z scale_ub=None, 2025-05-07T20:32:46.6197720Z contiguous=True, 2025-05-07T20:32:46.6197963Z compiled=True, 2025-05-07T20:32:46.6198182Z ) 2025-05-07T20:32:46.8692127Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:46.8693292Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:46.8694776Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:46.8696400Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:46.8697424Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.8698805Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:46.8700253Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.8701289Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.8702578Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:46.8704033Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.8705145Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.8706489Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:46.8707803Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:46.8709085Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:46.8710359Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:46.8711231Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.8712310Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:46.8713375Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:46.8714203Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:46.8715558Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:46.8716919Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:46.8718093Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:46.8719269Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:46.8720511Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:46.8721951Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:46.8723109Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.8724069Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.8724851Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:46.8726079Z W0507 20:32:46.866000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:46.9395670Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:46.9397703Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:46.9400257Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:46.9403014Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:46.9404092Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.9405471Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:46.9406929Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:46.9407954Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.9409252Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:46.9410707Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:46.9411997Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.9413361Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:46.9414901Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:46.9416198Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:46.9417486Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:46.9418373Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:46.9419461Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:46.9420539Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:46.9421380Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:46.9422678Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:46.9424082Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:46.9425253Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:46.9426499Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:46.9427739Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:46.9429171Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:46.9430290Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:46.9431238Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:46.9432011Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:46.9433135Z W0507 20:32:46.936000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.1465439Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:47.1466566Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:47.1468140Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:47.1469654Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:47.1470830Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:47.1472208Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:47.1473687Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.1474725Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:47.1476030Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:47.1477494Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.1478613Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:47.1479977Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:47.1481309Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:47.1482605Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:47.1483892Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:47.1484762Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:47.1485854Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:47.1486937Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:47.1487788Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:47.1489066Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:47.1490430Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:47.1491696Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:47.1492800Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:47.1494120Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:47.1495648Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:47.1496766Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.1497727Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.1498505Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:47.1499752Z W0507 20:32:47.143000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.1560499Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:32:47.1563073Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last): 2025-05-07T20:32:47.1564515Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:32:47.1566006Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:32:47.1567041Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:47.1568409Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:32:47.1569865Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.1570903Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:47.1572191Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:47.1573703Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.1574951Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:47.1576451Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:32:47.1577773Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] generator.visit(fn.parse()) 2025-05-07T20:32:47.1579224Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:32:47.1580496Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ret = super().visit(node) 2025-05-07T20:32:47.1581356Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:47.1582441Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:32:47.1583519Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] return visitor(node) 2025-05-07T20:32:47.1584351Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^^^^^^^^^^^^^ 2025-05-07T20:32:47.1585631Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:32:47.1588412Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:32:47.1589587Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:32:47.1590689Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] self.visit(item) 2025-05-07T20:32:47.1591931Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:32:47.1593363Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:32:47.1594467Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.1595417Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.1596195Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^ 2025-05-07T20:32:47.1597265Z W0507 20:32:47.153000 95353 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4068237Z self = 2025-05-07T20:32:47.4068977Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:47.4069394Z 2025-05-07T20:32:47.4069505Z @given( 2025-05-07T20:32:47.4069835Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.4070250Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.4070581Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.4070932Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.4071285Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.4071587Z ) 2025-05-07T20:32:47.4072120Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.4072587Z def test_silu_mul_quant( 2025-05-07T20:32:47.4072826Z self, 2025-05-07T20:32:47.4073028Z T: int, 2025-05-07T20:32:47.4073232Z D: int, 2025-05-07T20:32:47.4073558Z scale_ub: Optional[float], 2025-05-07T20:32:47.4073845Z contiguous: bool, 2025-05-07T20:32:47.4074093Z compiled: bool, 2025-05-07T20:32:47.4074313Z ) -> None: 2025-05-07T20:32:47.4074531Z torch.manual_seed(2025) 2025-05-07T20:32:47.4074788Z 2025-05-07T20:32:47.4075061Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.4075421Z 2025-05-07T20:32:47.4075619Z x_sign = torch.sign(x) 2025-05-07T20:32:47.4075911Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.4076228Z x = x_sign * x_clamp 2025-05-07T20:32:47.4076472Z x0 = x[:, :D] 2025-05-07T20:32:47.4076679Z x1 = x[:, D:] 2025-05-07T20:32:47.4076899Z 2025-05-07T20:32:47.4077085Z if contiguous: 2025-05-07T20:32:47.4077314Z x0 = x0.contiguous() 2025-05-07T20:32:47.4077574Z x1 = x1.contiguous() 2025-05-07T20:32:47.4077819Z 2025-05-07T20:32:47.4078009Z if scale_ub is not None: 2025-05-07T20:32:47.4078304Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.4078646Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.4078961Z ) 2025-05-07T20:32:47.4079158Z else: 2025-05-07T20:32:47.4079365Z scale_ub_tensor = None 2025-05-07T20:32:47.4079626Z 2025-05-07T20:32:47.4079862Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4080174Z op = silu_mul_quant 2025-05-07T20:32:47.4080430Z if compiled: 2025-05-07T20:32:47.4080682Z op = torch.compile(op) 2025-05-07T20:32:47.4080979Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.4081265Z 2025-05-07T20:32:47.4081470Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.4081753Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.4082056Z 2025-05-07T20:32:47.4082297Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.4082637Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.4082942Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.4083269Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.4083639Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.4083951Z 2025-05-07T20:32:47.4084152Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.4084353Z 2025-05-07T20:32:47.4084460Z moe/activation_test.py:126: 2025-05-07T20:32:47.4084760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4085099Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.4085436Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.4086253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.4087042Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.4087626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.4088341Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.4089061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.4089821Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.4090589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.4091381Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.4092009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.4092589Z fn() 2025-05-07T20:32:47.4093140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.4093831Z self.fn.run( 2025-05-07T20:32:47.4094320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.4095043Z kernel = self.compile( 2025-05-07T20:32:47.4095611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.4096297Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.4096713Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.4096952Z 2025-05-07T20:32:47.4097183Z self = 2025-05-07T20:32:47.4098313Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.4099751Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127a002520>} 2025-05-07T20:32:47.4101169Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.4102265Z context = 2025-05-07T20:32:47.4102565Z 2025-05-07T20:32:47.4102745Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.4103291Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.4103778Z module_map=module_map) 2025-05-07T20:32:47.4104162Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.4104541Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.4104814Z E ^ 2025-05-07T20:32:47.4105296Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.4105771Z 2025-05-07T20:32:47.4106216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.4106758Z 2025-05-07T20:32:47.4106871Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.4107294Z self=, 2025-05-07T20:32:47.4107718Z T=16384, 2025-05-07T20:32:47.4107921Z D=5120, 2025-05-07T20:32:47.4108111Z scale_ub=None, 2025-05-07T20:32:47.4108332Z contiguous=True, 2025-05-07T20:32:47.4108558Z compiled=True, 2025-05-07T20:32:47.4108755Z ) 2025-05-07T20:32:47.4393957Z W0507 20:32:47.437000 95353 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:32:47.4395266Z W0507 20:32:47.437000 95353 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:32:47.4396672Z W0507 20:32:47.437000 95353 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:32:47.4397709Z W0507 20:32:47.437000 95353 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:32:47.4399021Z W0507 20:32:47.437000 95353 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:32:47.5245759Z self = 2025-05-07T20:32:47.5247019Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:47.5247962Z 2025-05-07T20:32:47.5248141Z @given( 2025-05-07T20:32:47.5248603Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.5249227Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.5249845Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.5250520Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.5251179Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.5251754Z ) 2025-05-07T20:32:47.5252455Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.5253249Z def test_silu_mul_quant( 2025-05-07T20:32:47.5253506Z self, 2025-05-07T20:32:47.5253706Z T: int, 2025-05-07T20:32:47.5253906Z D: int, 2025-05-07T20:32:47.5254127Z scale_ub: Optional[float], 2025-05-07T20:32:47.5254533Z contiguous: bool, 2025-05-07T20:32:47.5254773Z compiled: bool, 2025-05-07T20:32:47.5254996Z ) -> None: 2025-05-07T20:32:47.5255207Z torch.manual_seed(2025) 2025-05-07T20:32:47.5255447Z 2025-05-07T20:32:47.5255715Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.5256067Z 2025-05-07T20:32:47.5256261Z x_sign = torch.sign(x) 2025-05-07T20:32:47.5256541Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.5256853Z x = x_sign * x_clamp 2025-05-07T20:32:47.5257089Z x0 = x[:, :D] 2025-05-07T20:32:47.5257297Z x1 = x[:, D:] 2025-05-07T20:32:47.5257505Z 2025-05-07T20:32:47.5257688Z if contiguous: 2025-05-07T20:32:47.5257913Z x0 = x0.contiguous() 2025-05-07T20:32:47.5258179Z x1 = x1.contiguous() 2025-05-07T20:32:47.5258425Z 2025-05-07T20:32:47.5258609Z if scale_ub is not None: 2025-05-07T20:32:47.5258884Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.5259223Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.5259529Z ) 2025-05-07T20:32:47.5259723Z else: 2025-05-07T20:32:47.5259932Z scale_ub_tensor = None 2025-05-07T20:32:47.5260184Z 2025-05-07T20:32:47.5260419Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.5260734Z op = silu_mul_quant 2025-05-07T20:32:47.5260988Z if compiled: 2025-05-07T20:32:47.5261228Z op = torch.compile(op) 2025-05-07T20:32:47.5261526Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.5261803Z 2025-05-07T20:32:47.5261989Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.5262272Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.5262572Z 2025-05-07T20:32:47.5262808Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.5263148Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.5263443Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.5263757Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.5264125Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.5264438Z 2025-05-07T20:32:47.5264635Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.5264831Z 2025-05-07T20:32:47.5264933Z moe/activation_test.py:126: 2025-05-07T20:32:47.5265231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.5265569Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.5265894Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.5266840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.5267640Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.5268209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.5269001Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.5269720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.5270478Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.5271241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.5271912Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.5272543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.5273150Z fn() 2025-05-07T20:32:47.5273676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.5274294Z self.fn.run( 2025-05-07T20:32:47.5274786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.5275344Z kernel = self.compile( 2025-05-07T20:32:47.5275910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.5276598Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.5277006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.5277241Z 2025-05-07T20:32:47.5277451Z self = 2025-05-07T20:32:47.5278585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.5280011Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1279963420>} 2025-05-07T20:32:47.5281426Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.5282514Z context = 2025-05-07T20:32:47.5282849Z 2025-05-07T20:32:47.5283030Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.5283571Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.5284064Z module_map=module_map) 2025-05-07T20:32:47.5284433Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.5284804Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.5285088Z E ^ 2025-05-07T20:32:47.5285573Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.5286050Z 2025-05-07T20:32:47.5286491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.5287038Z 2025-05-07T20:32:47.5287145Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.5287589Z self=, 2025-05-07T20:32:47.5288018Z T=1, 2025-05-07T20:32:47.5288208Z D=5120, 2025-05-07T20:32:47.5288417Z scale_ub=1200.0, 2025-05-07T20:32:47.5288653Z contiguous=True, 2025-05-07T20:32:47.5288881Z compiled=True, 2025-05-07T20:32:47.5289092Z ) 2025-05-07T20:32:47.6646964Z self = 2025-05-07T20:32:47.6647959Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:47.6648460Z 2025-05-07T20:32:47.6648614Z @given( 2025-05-07T20:32:47.6649234Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.6649814Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.6650377Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.6659161Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.6659509Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.6659798Z ) 2025-05-07T20:32:47.6660162Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.6660629Z def test_silu_mul_quant( 2025-05-07T20:32:47.6660880Z self, 2025-05-07T20:32:47.6661083Z T: int, 2025-05-07T20:32:47.6661285Z D: int, 2025-05-07T20:32:47.6661512Z scale_ub: Optional[float], 2025-05-07T20:32:47.6661801Z contiguous: bool, 2025-05-07T20:32:47.6662052Z compiled: bool, 2025-05-07T20:32:47.6662285Z ) -> None: 2025-05-07T20:32:47.6662502Z torch.manual_seed(2025) 2025-05-07T20:32:47.6662754Z 2025-05-07T20:32:47.6663047Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.6663401Z 2025-05-07T20:32:47.6663608Z x_sign = torch.sign(x) 2025-05-07T20:32:47.6663909Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.6664224Z x = x_sign * x_clamp 2025-05-07T20:32:47.6664470Z x0 = x[:, :D] 2025-05-07T20:32:47.6664692Z x1 = x[:, D:] 2025-05-07T20:32:47.6664901Z 2025-05-07T20:32:47.6665096Z if contiguous: 2025-05-07T20:32:47.6665332Z x0 = x0.contiguous() 2025-05-07T20:32:47.6665585Z x1 = x1.contiguous() 2025-05-07T20:32:47.6665832Z 2025-05-07T20:32:47.6666023Z if scale_ub is not None: 2025-05-07T20:32:47.6666296Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.6666639Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.6666957Z ) 2025-05-07T20:32:47.6667147Z else: 2025-05-07T20:32:47.6667377Z scale_ub_tensor = None 2025-05-07T20:32:47.6667646Z 2025-05-07T20:32:47.6667881Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.6668198Z op = silu_mul_quant 2025-05-07T20:32:47.6668450Z if compiled: 2025-05-07T20:32:47.6668705Z op = torch.compile(op) 2025-05-07T20:32:47.6669007Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.6669291Z 2025-05-07T20:32:47.6669482Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.6669649Z 2025-05-07T20:32:47.6669748Z moe/activation_test.py:117: 2025-05-07T20:32:47.6670054Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.6670402Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.6670683Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.6671271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.6671859Z return fn(*args, **kwargs) 2025-05-07T20:32:47.6672554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.6673275Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.6673837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.6674554Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.6675255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.6675810Z kernel = self.compile( 2025-05-07T20:32:47.6676487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.6677190Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.6677602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.6677924Z 2025-05-07T20:32:47.6678135Z self = 2025-05-07T20:32:47.6679268Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.6680703Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1279470180>} 2025-05-07T20:32:47.6682120Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.6683241Z context = 2025-05-07T20:32:47.6683566Z 2025-05-07T20:32:47.6683742Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.6684285Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.6684773Z module_map=module_map) 2025-05-07T20:32:47.6685142Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.6685507Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.6685776Z E ^ 2025-05-07T20:32:47.6686253Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.6686742Z 2025-05-07T20:32:47.6687188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.6687738Z 2025-05-07T20:32:47.6687840Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.6688268Z self=, 2025-05-07T20:32:47.6688690Z T=1, 2025-05-07T20:32:47.6688880Z D=5120, 2025-05-07T20:32:47.6689081Z scale_ub=None, 2025-05-07T20:32:47.6689295Z contiguous=False, 2025-05-07T20:32:47.6689528Z compiled=True, 2025-05-07T20:32:47.6689733Z ) 2025-05-07T20:32:47.7282476Z self = 2025-05-07T20:32:47.7283314Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:47.7283618Z 2025-05-07T20:32:47.7283699Z @given( 2025-05-07T20:32:47.7283934Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.7284252Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.7284557Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.7284899Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.7285235Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.7285526Z ) 2025-05-07T20:32:47.7285883Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.7286349Z def test_silu_mul_quant( 2025-05-07T20:32:47.7286588Z self, 2025-05-07T20:32:47.7286780Z T: int, 2025-05-07T20:32:47.7286975Z D: int, 2025-05-07T20:32:47.7287186Z scale_ub: Optional[float], 2025-05-07T20:32:47.7287464Z contiguous: bool, 2025-05-07T20:32:47.7287704Z compiled: bool, 2025-05-07T20:32:47.7287922Z ) -> None: 2025-05-07T20:32:47.7288146Z torch.manual_seed(2025) 2025-05-07T20:32:47.7288389Z 2025-05-07T20:32:47.7288660Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.7289024Z 2025-05-07T20:32:47.7289216Z x_sign = torch.sign(x) 2025-05-07T20:32:47.7289672Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.7289989Z x = x_sign * x_clamp 2025-05-07T20:32:47.7290226Z x0 = x[:, :D] 2025-05-07T20:32:47.7290439Z x1 = x[:, D:] 2025-05-07T20:32:47.7290646Z 2025-05-07T20:32:47.7290943Z if contiguous: 2025-05-07T20:32:47.7291180Z x0 = x0.contiguous() 2025-05-07T20:32:47.7291435Z x1 = x1.contiguous() 2025-05-07T20:32:47.7291678Z 2025-05-07T20:32:47.7291872Z if scale_ub is not None: 2025-05-07T20:32:47.7292144Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.7292485Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.7292804Z ) 2025-05-07T20:32:47.7292992Z else: 2025-05-07T20:32:47.7293205Z scale_ub_tensor = None 2025-05-07T20:32:47.7293462Z 2025-05-07T20:32:47.7293689Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.7294012Z op = silu_mul_quant 2025-05-07T20:32:47.7294271Z if compiled: 2025-05-07T20:32:47.7294626Z op = torch.compile(op) 2025-05-07T20:32:47.7294922Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.7295211Z 2025-05-07T20:32:47.7295401Z y_fp8, y_scale = fn() 2025-05-07T20:32:47.7295688Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:47.7295987Z 2025-05-07T20:32:47.7296226Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.7296560Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:47.7296864Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:47.7297183Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:47.7297541Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.7297858Z 2025-05-07T20:32:47.7298056Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:47.7298255Z 2025-05-07T20:32:47.7298356Z moe/activation_test.py:126: 2025-05-07T20:32:47.7298654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.7299008Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:47.7299348Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:47.7300172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:47.7300960Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:47.7301531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.7302244Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.7302964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:47.7303774Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:47.7304549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:47.7305221Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:47.7305850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:47.7306401Z fn() 2025-05-07T20:32:47.7306935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:47.7307541Z self.fn.run( 2025-05-07T20:32:47.7308030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.7308582Z kernel = self.compile( 2025-05-07T20:32:47.7309137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.7309911Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.7310331Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.7310576Z 2025-05-07T20:32:47.7310796Z self = 2025-05-07T20:32:47.7311999Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.7313482Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1279473240>} 2025-05-07T20:32:47.7314896Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.7315993Z context = 2025-05-07T20:32:47.7316299Z 2025-05-07T20:32:47.7316475Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.7317011Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.7317504Z module_map=module_map) 2025-05-07T20:32:47.7317886Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.7318252Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:47.7318529Z E ^ 2025-05-07T20:32:47.7319013Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.7319490Z 2025-05-07T20:32:47.7319933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.7320478Z 2025-05-07T20:32:47.7320582Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.7321022Z self=, 2025-05-07T20:32:47.7321446Z T=1, 2025-05-07T20:32:47.7321623Z D=5120, 2025-05-07T20:32:47.7321830Z scale_ub=None, 2025-05-07T20:32:47.7322051Z contiguous=True, 2025-05-07T20:32:47.7322283Z compiled=False, 2025-05-07T20:32:47.7322485Z ) 2025-05-07T20:32:47.8800580Z self = 2025-05-07T20:32:47.8801175Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:47.8801569Z 2025-05-07T20:32:47.8801698Z @given( 2025-05-07T20:32:47.8802007Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.8802329Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.8802666Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.8803286Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.8803913Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.8804451Z ) 2025-05-07T20:32:47.8805090Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.8805909Z def test_silu_mul_quant( 2025-05-07T20:32:47.8806344Z self, 2025-05-07T20:32:47.8806700Z T: int, 2025-05-07T20:32:47.8807042Z D: int, 2025-05-07T20:32:47.8807437Z scale_ub: Optional[float], 2025-05-07T20:32:47.8807923Z contiguous: bool, 2025-05-07T20:32:47.8808348Z compiled: bool, 2025-05-07T20:32:47.8808749Z ) -> None: 2025-05-07T20:32:47.8809147Z torch.manual_seed(2025) 2025-05-07T20:32:47.8809586Z 2025-05-07T20:32:47.8810099Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.8810753Z 2025-05-07T20:32:47.8811102Z x_sign = torch.sign(x) 2025-05-07T20:32:47.8811638Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.8812219Z x = x_sign * x_clamp 2025-05-07T20:32:47.8812945Z x0 = x[:, :D] 2025-05-07T20:32:47.8813326Z x1 = x[:, D:] 2025-05-07T20:32:47.8813562Z 2025-05-07T20:32:47.8813754Z if contiguous: 2025-05-07T20:32:47.8813983Z x0 = x0.contiguous() 2025-05-07T20:32:47.8814250Z x1 = x1.contiguous() 2025-05-07T20:32:47.8814739Z 2025-05-07T20:32:47.8814938Z if scale_ub is not None: 2025-05-07T20:32:47.8815220Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.8815567Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.8815882Z ) 2025-05-07T20:32:47.8816084Z else: 2025-05-07T20:32:47.8816301Z scale_ub_tensor = None 2025-05-07T20:32:47.8816554Z 2025-05-07T20:32:47.8816791Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.8817120Z op = silu_mul_quant 2025-05-07T20:32:47.8817371Z if compiled: 2025-05-07T20:32:47.8817625Z op = torch.compile(op) 2025-05-07T20:32:47.8817939Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.8818226Z 2025-05-07T20:32:47.8818418Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.8818593Z 2025-05-07T20:32:47.8818694Z moe/activation_test.py:117: 2025-05-07T20:32:47.8819002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.8819349Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.8819636Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.8820370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.8821094Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.8821660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.8822383Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.8823090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.8823650Z kernel = self.compile( 2025-05-07T20:32:47.8824216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.8824911Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.8825329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.8825737Z 2025-05-07T20:32:47.8825950Z self = 2025-05-07T20:32:47.8827083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.8828522Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127a001bc0>} 2025-05-07T20:32:47.8829932Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.8831023Z context = 2025-05-07T20:32:47.8831329Z 2025-05-07T20:32:47.8831502Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.8832045Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.8832533Z module_map=module_map) 2025-05-07T20:32:47.8832901Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.8833267Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.8833536Z E ^ 2025-05-07T20:32:47.8834137Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.8834618Z 2025-05-07T20:32:47.8835060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.8835608Z 2025-05-07T20:32:47.8835824Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.8836258Z self=, 2025-05-07T20:32:47.8836674Z T=128, 2025-05-07T20:32:47.8836867Z D=5120, 2025-05-07T20:32:47.8837065Z scale_ub=None, 2025-05-07T20:32:47.8837282Z contiguous=False, 2025-05-07T20:32:47.8837512Z compiled=True, 2025-05-07T20:32:47.8837721Z ) 2025-05-07T20:32:47.8838042Z self = 2025-05-07T20:32:47.8838548Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:47.8838854Z 2025-05-07T20:32:47.8838938Z @given( 2025-05-07T20:32:47.8839184Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.8839511Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.8839820Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.8840164Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.8840508Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.8840798Z ) 2025-05-07T20:32:47.8841156Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.8841617Z def test_silu_mul_quant( 2025-05-07T20:32:47.8841862Z self, 2025-05-07T20:32:47.8842063Z T: int, 2025-05-07T20:32:47.8842266Z D: int, 2025-05-07T20:32:47.8842483Z scale_ub: Optional[float], 2025-05-07T20:32:47.8842763Z contiguous: bool, 2025-05-07T20:32:47.8843005Z compiled: bool, 2025-05-07T20:32:47.8843232Z ) -> None: 2025-05-07T20:32:47.8843445Z torch.manual_seed(2025) 2025-05-07T20:32:47.8843694Z 2025-05-07T20:32:47.8843990Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.8844345Z 2025-05-07T20:32:47.8844542Z x_sign = torch.sign(x) 2025-05-07T20:32:47.8844837Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.8845158Z x = x_sign * x_clamp 2025-05-07T20:32:47.8845405Z x0 = x[:, :D] 2025-05-07T20:32:47.8845624Z x1 = x[:, D:] 2025-05-07T20:32:47.8845834Z 2025-05-07T20:32:47.8846026Z if contiguous: 2025-05-07T20:32:47.8846263Z x0 = x0.contiguous() 2025-05-07T20:32:47.8846526Z x1 = x1.contiguous() 2025-05-07T20:32:47.8846774Z 2025-05-07T20:32:47.8846971Z if scale_ub is not None: 2025-05-07T20:32:47.8847246Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.8847590Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.8847912Z ) 2025-05-07T20:32:47.8848097Z else: 2025-05-07T20:32:47.8848315Z scale_ub_tensor = None 2025-05-07T20:32:47.8848583Z 2025-05-07T20:32:47.8848817Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.8849132Z op = silu_mul_quant 2025-05-07T20:32:47.8849382Z if compiled: 2025-05-07T20:32:47.8849632Z op = torch.compile(op) 2025-05-07T20:32:47.8849935Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.8850213Z 2025-05-07T20:32:47.8850404Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.8850569Z 2025-05-07T20:32:47.8850666Z moe/activation_test.py:117: 2025-05-07T20:32:47.8850970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.8851311Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.8851590Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.8852169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:47.8852758Z return fn(*args, **kwargs) 2025-05-07T20:32:47.8853531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.8854261Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:47.8854905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:47.8855704Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:47.8856406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:47.8856965Z kernel = self.compile( 2025-05-07T20:32:47.8857532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:47.8858225Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:47.8858642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.8858896Z 2025-05-07T20:32:47.8859110Z self = 2025-05-07T20:32:47.8860237Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:47.8861675Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1279459b20>} 2025-05-07T20:32:47.8863083Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:47.8864168Z context = 2025-05-07T20:32:47.8864473Z 2025-05-07T20:32:47.8864647Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:47.8865190Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:47.8865676Z module_map=module_map) 2025-05-07T20:32:47.8866053Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:47.8866420Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:47.8866688Z E ^ 2025-05-07T20:32:47.8867168Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:47.8867648Z 2025-05-07T20:32:47.8868089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:47.8868637Z 2025-05-07T20:32:47.8868744Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:47.8869174Z self=, 2025-05-07T20:32:47.8869588Z T=128, 2025-05-07T20:32:47.8869788Z D=7168, 2025-05-07T20:32:47.8869988Z scale_ub=1200.0, 2025-05-07T20:32:47.8870215Z contiguous=False, 2025-05-07T20:32:47.8870448Z compiled=False, 2025-05-07T20:32:47.8870656Z ) 2025-05-07T20:32:47.9977907Z self = 2025-05-07T20:32:47.9978472Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:47.9978774Z 2025-05-07T20:32:47.9978860Z @given( 2025-05-07T20:32:47.9979091Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:47.9979413Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:47.9979721Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:47.9980058Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:47.9980398Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:47.9980690Z ) 2025-05-07T20:32:47.9981196Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:47.9981660Z def test_silu_mul_quant( 2025-05-07T20:32:47.9981903Z self, 2025-05-07T20:32:47.9982089Z T: int, 2025-05-07T20:32:47.9982279Z D: int, 2025-05-07T20:32:47.9982490Z scale_ub: Optional[float], 2025-05-07T20:32:47.9982940Z contiguous: bool, 2025-05-07T20:32:47.9983176Z compiled: bool, 2025-05-07T20:32:47.9983392Z ) -> None: 2025-05-07T20:32:47.9983603Z torch.manual_seed(2025) 2025-05-07T20:32:47.9983843Z 2025-05-07T20:32:47.9984116Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:47.9984463Z 2025-05-07T20:32:47.9984652Z x_sign = torch.sign(x) 2025-05-07T20:32:47.9984941Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:47.9985247Z x = x_sign * x_clamp 2025-05-07T20:32:47.9985485Z x0 = x[:, :D] 2025-05-07T20:32:47.9985696Z x1 = x[:, D:] 2025-05-07T20:32:47.9985897Z 2025-05-07T20:32:47.9986082Z if contiguous: 2025-05-07T20:32:47.9986313Z x0 = x0.contiguous() 2025-05-07T20:32:47.9986567Z x1 = x1.contiguous() 2025-05-07T20:32:47.9986810Z 2025-05-07T20:32:47.9986998Z if scale_ub is not None: 2025-05-07T20:32:47.9987263Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:47.9987609Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:47.9987917Z ) 2025-05-07T20:32:47.9988111Z else: 2025-05-07T20:32:47.9988324Z scale_ub_tensor = None 2025-05-07T20:32:47.9995258Z 2025-05-07T20:32:47.9995517Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:47.9995858Z op = silu_mul_quant 2025-05-07T20:32:47.9996120Z if compiled: 2025-05-07T20:32:47.9996366Z op = torch.compile(op) 2025-05-07T20:32:47.9996672Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.9996952Z 2025-05-07T20:32:47.9997145Z > y_fp8, y_scale = fn() 2025-05-07T20:32:47.9997320Z 2025-05-07T20:32:47.9997417Z moe/activation_test.py:117: 2025-05-07T20:32:47.9997722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:47.9998064Z moe/activation_test.py:115: in fn 2025-05-07T20:32:47.9998351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:47.9999069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:47.9999794Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.0000355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.0001069Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.0001765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.0002321Z kernel = self.compile( 2025-05-07T20:32:48.0002879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.0003565Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.0003976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.0004220Z 2025-05-07T20:32:48.0004438Z self = 2025-05-07T20:32:48.0005556Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.0006977Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1278dbcea0>} 2025-05-07T20:32:48.0008504Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.0009595Z context = 2025-05-07T20:32:48.0009969Z 2025-05-07T20:32:48.0010142Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.0010675Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.0011151Z module_map=module_map) 2025-05-07T20:32:48.0011527Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.0011889Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.0012156Z E ^ 2025-05-07T20:32:48.0012640Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.0013110Z 2025-05-07T20:32:48.0013562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.0014111Z 2025-05-07T20:32:48.0014217Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.0014738Z self=, 2025-05-07T20:32:48.0015167Z T=128, 2025-05-07T20:32:48.0015363Z D=5120, 2025-05-07T20:32:48.0015558Z scale_ub=None, 2025-05-07T20:32:48.0015777Z contiguous=False, 2025-05-07T20:32:48.0016007Z compiled=False, 2025-05-07T20:32:48.0016217Z ) 2025-05-07T20:32:48.0016545Z self = 2025-05-07T20:32:48.0017080Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:48.0017363Z 2025-05-07T20:32:48.0017446Z @given( 2025-05-07T20:32:48.0017674Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.0017995Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.0018319Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.0018654Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.0018992Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.0019292Z ) 2025-05-07T20:32:48.0019661Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.0020124Z def test_silu_mul_quant( 2025-05-07T20:32:48.0020373Z self, 2025-05-07T20:32:48.0020570Z T: int, 2025-05-07T20:32:48.0020768Z D: int, 2025-05-07T20:32:48.0020994Z scale_ub: Optional[float], 2025-05-07T20:32:48.0021278Z contiguous: bool, 2025-05-07T20:32:48.0021518Z compiled: bool, 2025-05-07T20:32:48.0021741Z ) -> None: 2025-05-07T20:32:48.0021951Z torch.manual_seed(2025) 2025-05-07T20:32:48.0022185Z 2025-05-07T20:32:48.0022466Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.0022823Z 2025-05-07T20:32:48.0023012Z x_sign = torch.sign(x) 2025-05-07T20:32:48.0023301Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.0023615Z x = x_sign * x_clamp 2025-05-07T20:32:48.0023851Z x0 = x[:, :D] 2025-05-07T20:32:48.0024065Z x1 = x[:, D:] 2025-05-07T20:32:48.0024279Z 2025-05-07T20:32:48.0024461Z if contiguous: 2025-05-07T20:32:48.0024688Z x0 = x0.contiguous() 2025-05-07T20:32:48.0024945Z x1 = x1.contiguous() 2025-05-07T20:32:48.0025188Z 2025-05-07T20:32:48.0025378Z if scale_ub is not None: 2025-05-07T20:32:48.0025811Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.0026147Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.0026469Z ) 2025-05-07T20:32:48.0026665Z else: 2025-05-07T20:32:48.0026870Z scale_ub_tensor = None 2025-05-07T20:32:48.0027120Z 2025-05-07T20:32:48.0027349Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.0027814Z op = silu_mul_quant 2025-05-07T20:32:48.0028070Z if compiled: 2025-05-07T20:32:48.0028323Z op = torch.compile(op) 2025-05-07T20:32:48.0028630Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.0029018Z 2025-05-07T20:32:48.0029212Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.0029379Z 2025-05-07T20:32:48.0029484Z moe/activation_test.py:117: 2025-05-07T20:32:48.0029782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.0030128Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.0030411Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.0031129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.0031843Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.0032403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.0033113Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.0033799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.0034353Z kernel = self.compile( 2025-05-07T20:32:48.0034911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.0035590Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.0035992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.0036241Z 2025-05-07T20:32:48.0036449Z self = 2025-05-07T20:32:48.0037571Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.0038992Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1278dbcc20>} 2025-05-07T20:32:48.0040395Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.0041478Z context = 2025-05-07T20:32:48.0041781Z 2025-05-07T20:32:48.0041945Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.0042474Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.0042944Z module_map=module_map) 2025-05-07T20:32:48.0043341Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.0043719Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.0043982Z E ^ 2025-05-07T20:32:48.0044450Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.0044928Z 2025-05-07T20:32:48.0045366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.0045904Z 2025-05-07T20:32:48.0046012Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.0046435Z self=, 2025-05-07T20:32:48.0046841Z T=128, 2025-05-07T20:32:48.0047029Z D=5120, 2025-05-07T20:32:48.0047226Z scale_ub=1200.0, 2025-05-07T20:32:48.0047440Z contiguous=True, 2025-05-07T20:32:48.0047656Z compiled=False, 2025-05-07T20:32:48.0047862Z ) 2025-05-07T20:32:48.4006813Z self = 2025-05-07T20:32:48.4007507Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:48.4007837Z 2025-05-07T20:32:48.4007960Z @given( 2025-05-07T20:32:48.4008205Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.4008667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.4009059Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.4009522Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.4009947Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.4010318Z ) 2025-05-07T20:32:48.4010775Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.4011239Z def test_silu_mul_quant( 2025-05-07T20:32:48.4011498Z self, 2025-05-07T20:32:48.4011705Z T: int, 2025-05-07T20:32:48.4011900Z D: int, 2025-05-07T20:32:48.4012120Z scale_ub: Optional[float], 2025-05-07T20:32:48.4012401Z contiguous: bool, 2025-05-07T20:32:48.4012650Z compiled: bool, 2025-05-07T20:32:48.4012876Z ) -> None: 2025-05-07T20:32:48.4013089Z torch.manual_seed(2025) 2025-05-07T20:32:48.4013342Z 2025-05-07T20:32:48.4013660Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.4014016Z 2025-05-07T20:32:48.4014209Z x_sign = torch.sign(x) 2025-05-07T20:32:48.4014627Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.4014934Z x = x_sign * x_clamp 2025-05-07T20:32:48.4015175Z x0 = x[:, :D] 2025-05-07T20:32:48.4015385Z x1 = x[:, D:] 2025-05-07T20:32:48.4015589Z 2025-05-07T20:32:48.4015768Z if contiguous: 2025-05-07T20:32:48.4015990Z x0 = x0.contiguous() 2025-05-07T20:32:48.4016251Z x1 = x1.contiguous() 2025-05-07T20:32:48.4016491Z 2025-05-07T20:32:48.4016681Z if scale_ub is not None: 2025-05-07T20:32:48.4016951Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.4017300Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.4017615Z ) 2025-05-07T20:32:48.4017812Z else: 2025-05-07T20:32:48.4018026Z scale_ub_tensor = None 2025-05-07T20:32:48.4018284Z 2025-05-07T20:32:48.4018518Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.4018841Z op = silu_mul_quant 2025-05-07T20:32:48.4019091Z if compiled: 2025-05-07T20:32:48.4019335Z op = torch.compile(op) 2025-05-07T20:32:48.4019638Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.4019916Z 2025-05-07T20:32:48.4020099Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.4020270Z 2025-05-07T20:32:48.4020369Z moe/activation_test.py:117: 2025-05-07T20:32:48.4020669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.4021000Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.4021286Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.4022010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.4022733Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.4023283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.4024002Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.4024697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.4025248Z kernel = self.compile( 2025-05-07T20:32:48.4025970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.4026657Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.4027061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.4027446Z 2025-05-07T20:32:48.4027666Z self = 2025-05-07T20:32:48.4028787Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.4030357Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1278dbeca0>} 2025-05-07T20:32:48.4031762Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.4032848Z context = 2025-05-07T20:32:48.4033145Z 2025-05-07T20:32:48.4033328Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.4033861Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.4034347Z module_map=module_map) 2025-05-07T20:32:48.4034726Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.4035089Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.4035354Z E ^ 2025-05-07T20:32:48.4035832Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.4036302Z 2025-05-07T20:32:48.4036742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.4037282Z 2025-05-07T20:32:48.4037386Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.4037811Z self=, 2025-05-07T20:32:48.4038230Z T=1, 2025-05-07T20:32:48.4038417Z D=7168, 2025-05-07T20:32:48.4038612Z scale_ub=1200.0, 2025-05-07T20:32:48.4038836Z contiguous=True, 2025-05-07T20:32:48.4039055Z compiled=True, 2025-05-07T20:32:48.4039261Z ) 2025-05-07T20:32:48.4039583Z self = 2025-05-07T20:32:48.4040091Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:48.4040360Z 2025-05-07T20:32:48.4040439Z @given( 2025-05-07T20:32:48.4040669Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.4040985Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.4041293Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.4041626Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.4041957Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.4042241Z ) 2025-05-07T20:32:48.4042598Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.4043049Z def test_silu_mul_quant( 2025-05-07T20:32:48.4043312Z self, 2025-05-07T20:32:48.4043509Z T: int, 2025-05-07T20:32:48.4043714Z D: int, 2025-05-07T20:32:48.4043924Z scale_ub: Optional[float], 2025-05-07T20:32:48.4044198Z contiguous: bool, 2025-05-07T20:32:48.4044429Z compiled: bool, 2025-05-07T20:32:48.4044643Z ) -> None: 2025-05-07T20:32:48.4044855Z torch.manual_seed(2025) 2025-05-07T20:32:48.4045086Z 2025-05-07T20:32:48.4045361Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.4045715Z 2025-05-07T20:32:48.4045898Z x_sign = torch.sign(x) 2025-05-07T20:32:48.4046183Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.4046492Z x = x_sign * x_clamp 2025-05-07T20:32:48.4046724Z x0 = x[:, :D] 2025-05-07T20:32:48.4046934Z x1 = x[:, D:] 2025-05-07T20:32:48.4047141Z 2025-05-07T20:32:48.4047404Z if contiguous: 2025-05-07T20:32:48.4047632Z x0 = x0.contiguous() 2025-05-07T20:32:48.4047891Z x1 = x1.contiguous() 2025-05-07T20:32:48.4048126Z 2025-05-07T20:32:48.4048310Z if scale_ub is not None: 2025-05-07T20:32:48.4048668Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.4049003Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.4049306Z ) 2025-05-07T20:32:48.4049495Z else: 2025-05-07T20:32:48.4049701Z scale_ub_tensor = None 2025-05-07T20:32:48.4049947Z 2025-05-07T20:32:48.4050170Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.4050485Z op = silu_mul_quant 2025-05-07T20:32:48.4050727Z if compiled: 2025-05-07T20:32:48.4050968Z op = torch.compile(op) 2025-05-07T20:32:48.4051266Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.4051536Z 2025-05-07T20:32:48.4051728Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.4051896Z 2025-05-07T20:32:48.4051996Z moe/activation_test.py:117: 2025-05-07T20:32:48.4052285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.4052616Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.4052909Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.4053488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:48.4054066Z return fn(*args, **kwargs) 2025-05-07T20:32:48.4054849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.4055569Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.4056118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.4056824Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.4057520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.4058070Z kernel = self.compile( 2025-05-07T20:32:48.4058622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.4059306Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.4059707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.4059937Z 2025-05-07T20:32:48.4060149Z self = 2025-05-07T20:32:48.4061259Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.4062681Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1278524040>} 2025-05-07T20:32:48.4064072Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.4065154Z context = 2025-05-07T20:32:48.4065449Z 2025-05-07T20:32:48.4065613Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.4066149Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.4066629Z module_map=module_map) 2025-05-07T20:32:48.4066996Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.4067346Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.4067601Z E ^ 2025-05-07T20:32:48.4068156Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.4068627Z 2025-05-07T20:32:48.4069060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.4069685Z 2025-05-07T20:32:48.4069789Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.4070212Z self=, 2025-05-07T20:32:48.4070635Z T=1, 2025-05-07T20:32:48.4070818Z D=7168, 2025-05-07T20:32:48.4071018Z scale_ub=1200.0, 2025-05-07T20:32:48.4071250Z contiguous=False, 2025-05-07T20:32:48.4071475Z compiled=True, 2025-05-07T20:32:48.4071683Z ) 2025-05-07T20:32:48.5428498Z self = 2025-05-07T20:32:48.5429028Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:48.5429337Z 2025-05-07T20:32:48.5429431Z @given( 2025-05-07T20:32:48.5429658Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.5429978Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.5430290Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.5430628Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.5430962Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.5431250Z ) 2025-05-07T20:32:48.5431600Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.5432058Z def test_silu_mul_quant( 2025-05-07T20:32:48.5432298Z self, 2025-05-07T20:32:48.5432489Z T: int, 2025-05-07T20:32:48.5432687Z D: int, 2025-05-07T20:32:48.5432906Z scale_ub: Optional[float], 2025-05-07T20:32:48.5433182Z contiguous: bool, 2025-05-07T20:32:48.5433424Z compiled: bool, 2025-05-07T20:32:48.5433646Z ) -> None: 2025-05-07T20:32:48.5433865Z torch.manual_seed(2025) 2025-05-07T20:32:48.5434116Z 2025-05-07T20:32:48.5434385Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.5434741Z 2025-05-07T20:32:48.5434927Z x_sign = torch.sign(x) 2025-05-07T20:32:48.5435226Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.5435546Z x = x_sign * x_clamp 2025-05-07T20:32:48.5435782Z x0 = x[:, :D] 2025-05-07T20:32:48.5436001Z x1 = x[:, D:] 2025-05-07T20:32:48.5436208Z 2025-05-07T20:32:48.5436393Z if contiguous: 2025-05-07T20:32:48.5436625Z x0 = x0.contiguous() 2025-05-07T20:32:48.5436893Z x1 = x1.contiguous() 2025-05-07T20:32:48.5437136Z 2025-05-07T20:32:48.5437329Z if scale_ub is not None: 2025-05-07T20:32:48.5437607Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.5437945Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.5438254Z ) 2025-05-07T20:32:48.5438472Z else: 2025-05-07T20:32:48.5438692Z scale_ub_tensor = None 2025-05-07T20:32:48.5438952Z 2025-05-07T20:32:48.5439192Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.5439513Z op = silu_mul_quant 2025-05-07T20:32:48.5439771Z if compiled: 2025-05-07T20:32:48.5440025Z op = torch.compile(op) 2025-05-07T20:32:48.5440328Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.5440603Z 2025-05-07T20:32:48.5440799Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.5440964Z 2025-05-07T20:32:48.5441066Z moe/activation_test.py:117: 2025-05-07T20:32:48.5441359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.5441699Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.5441985Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.5442566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:48.5443347Z return fn(*args, **kwargs) 2025-05-07T20:32:48.5444036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.5444763Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.5445425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.5446138Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.5446834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.5447390Z kernel = self.compile( 2025-05-07T20:32:48.5447947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.5448636Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.5449049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.5449289Z 2025-05-07T20:32:48.5449499Z self = 2025-05-07T20:32:48.5450614Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.5452042Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1278524ea0>} 2025-05-07T20:32:48.5453500Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.5454668Z context = 2025-05-07T20:32:48.5454972Z 2025-05-07T20:32:48.5455141Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.5455672Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.5456166Z module_map=module_map) 2025-05-07T20:32:48.5456540Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.5456896Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.5457159Z E ^ 2025-05-07T20:32:48.5457642Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.5458114Z 2025-05-07T20:32:48.5458553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.5459100Z 2025-05-07T20:32:48.5459202Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.5459635Z self=, 2025-05-07T20:32:48.5460050Z T=1, 2025-05-07T20:32:48.5460236Z D=7168, 2025-05-07T20:32:48.5460430Z scale_ub=None, 2025-05-07T20:32:48.5466911Z contiguous=False, 2025-05-07T20:32:48.5467149Z compiled=True, 2025-05-07T20:32:48.5467372Z ) 2025-05-07T20:32:48.6338224Z self = 2025-05-07T20:32:48.6338762Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:48.6339074Z 2025-05-07T20:32:48.6339184Z @given( 2025-05-07T20:32:48.6339527Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.6339845Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.6340160Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.6340494Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.6340830Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.6341120Z ) 2025-05-07T20:32:48.6341645Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.6342107Z def test_silu_mul_quant( 2025-05-07T20:32:48.6342343Z self, 2025-05-07T20:32:48.6342540Z T: int, 2025-05-07T20:32:48.6342733Z D: int, 2025-05-07T20:32:48.6343073Z scale_ub: Optional[float], 2025-05-07T20:32:48.6343371Z contiguous: bool, 2025-05-07T20:32:48.6343630Z compiled: bool, 2025-05-07T20:32:48.6343863Z ) -> None: 2025-05-07T20:32:48.6344086Z torch.manual_seed(2025) 2025-05-07T20:32:48.6344333Z 2025-05-07T20:32:48.6344605Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.6344963Z 2025-05-07T20:32:48.6345156Z x_sign = torch.sign(x) 2025-05-07T20:32:48.6345443Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.6345769Z x = x_sign * x_clamp 2025-05-07T20:32:48.6346017Z x0 = x[:, :D] 2025-05-07T20:32:48.6346240Z x1 = x[:, D:] 2025-05-07T20:32:48.6346458Z 2025-05-07T20:32:48.6346654Z if contiguous: 2025-05-07T20:32:48.6346895Z x0 = x0.contiguous() 2025-05-07T20:32:48.6347162Z x1 = x1.contiguous() 2025-05-07T20:32:48.6347411Z 2025-05-07T20:32:48.6347648Z if scale_ub is not None: 2025-05-07T20:32:48.6347934Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.6348279Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.6348592Z ) 2025-05-07T20:32:48.6348785Z else: 2025-05-07T20:32:48.6349001Z scale_ub_tensor = None 2025-05-07T20:32:48.6349264Z 2025-05-07T20:32:48.6349504Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.6349816Z op = silu_mul_quant 2025-05-07T20:32:48.6350067Z if compiled: 2025-05-07T20:32:48.6350315Z op = torch.compile(op) 2025-05-07T20:32:48.6350614Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.6350901Z 2025-05-07T20:32:48.6351094Z y_fp8, y_scale = fn() 2025-05-07T20:32:48.6351379Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:48.6351679Z 2025-05-07T20:32:48.6351917Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.6352263Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:48.6352561Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:48.6352881Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:48.6353247Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.6353559Z 2025-05-07T20:32:48.6353769Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:48.6353968Z 2025-05-07T20:32:48.6354076Z moe/activation_test.py:126: 2025-05-07T20:32:48.6354380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.6354732Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:48.6355077Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:48.6355905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:48.6356688Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:48.6357270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.6357984Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.6358704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:48.6359460Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:48.6360227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:48.6360898Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:48.6361608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:48.6362165Z fn() 2025-05-07T20:32:48.6362703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:48.6363403Z self.fn.run( 2025-05-07T20:32:48.6363891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.6364454Z kernel = self.compile( 2025-05-07T20:32:48.6365026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.6365713Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.6366136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.6366384Z 2025-05-07T20:32:48.6366604Z self = 2025-05-07T20:32:48.6367742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.6369182Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1278e5ec00>} 2025-05-07T20:32:48.6370597Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.6371688Z context = 2025-05-07T20:32:48.6371990Z 2025-05-07T20:32:48.6372170Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.6372723Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.6373208Z module_map=module_map) 2025-05-07T20:32:48.6373587Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.6373963Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:48.6374234Z E ^ 2025-05-07T20:32:48.6374834Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.6375308Z 2025-05-07T20:32:48.6375753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.6376298Z 2025-05-07T20:32:48.6376409Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.6376835Z self=, 2025-05-07T20:32:48.6377263Z T=1, 2025-05-07T20:32:48.6377459Z D=5120, 2025-05-07T20:32:48.6377656Z scale_ub=1200.0, 2025-05-07T20:32:48.6377894Z contiguous=False, 2025-05-07T20:32:48.6378132Z compiled=True, 2025-05-07T20:32:48.6378343Z ) 2025-05-07T20:32:48.7879771Z self = 2025-05-07T20:32:48.7881079Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:48.7881690Z 2025-05-07T20:32:48.7881856Z @given( 2025-05-07T20:32:48.7882320Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.7882826Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.7883175Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.7883526Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.7883872Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.7884168Z ) 2025-05-07T20:32:48.7884528Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.7884981Z def test_silu_mul_quant( 2025-05-07T20:32:48.7885382Z self, 2025-05-07T20:32:48.7885582Z T: int, 2025-05-07T20:32:48.7885784Z D: int, 2025-05-07T20:32:48.7885992Z scale_ub: Optional[float], 2025-05-07T20:32:48.7886269Z contiguous: bool, 2025-05-07T20:32:48.7886510Z compiled: bool, 2025-05-07T20:32:48.7886850Z ) -> None: 2025-05-07T20:32:48.7887061Z torch.manual_seed(2025) 2025-05-07T20:32:48.7887308Z 2025-05-07T20:32:48.7887571Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.7887927Z 2025-05-07T20:32:48.7888122Z x_sign = torch.sign(x) 2025-05-07T20:32:48.7888414Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.7888727Z x = x_sign * x_clamp 2025-05-07T20:32:48.7888965Z x0 = x[:, :D] 2025-05-07T20:32:48.7889182Z x1 = x[:, D:] 2025-05-07T20:32:48.7889388Z 2025-05-07T20:32:48.7889571Z if contiguous: 2025-05-07T20:32:48.7889808Z x0 = x0.contiguous() 2025-05-07T20:32:48.7890072Z x1 = x1.contiguous() 2025-05-07T20:32:48.7890312Z 2025-05-07T20:32:48.7890504Z if scale_ub is not None: 2025-05-07T20:32:48.7890772Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.7891114Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.7891435Z ) 2025-05-07T20:32:48.7891620Z else: 2025-05-07T20:32:48.7891829Z scale_ub_tensor = None 2025-05-07T20:32:48.7892082Z 2025-05-07T20:32:48.7892307Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.7892626Z op = silu_mul_quant 2025-05-07T20:32:48.7892875Z if compiled: 2025-05-07T20:32:48.7893129Z op = torch.compile(op) 2025-05-07T20:32:48.7893425Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.7893711Z 2025-05-07T20:32:48.7893901Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.7894070Z 2025-05-07T20:32:48.7894172Z moe/activation_test.py:117: 2025-05-07T20:32:48.7894590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.7894930Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.7895216Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.7895800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:48.7896392Z return fn(*args, **kwargs) 2025-05-07T20:32:48.7897077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.7897794Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.7898352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.7899070Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.7899762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.7900324Z kernel = self.compile( 2025-05-07T20:32:48.7900885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.7901569Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.7901983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.7902232Z 2025-05-07T20:32:48.7902438Z self = 2025-05-07T20:32:48.7903611Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.7905120Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1279459f80>} 2025-05-07T20:32:48.7906523Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.7907710Z context = 2025-05-07T20:32:48.7908015Z 2025-05-07T20:32:48.7908189Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.7908732Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.7909211Z module_map=module_map) 2025-05-07T20:32:48.7909591Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.7909968Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.7910240Z E ^ 2025-05-07T20:32:48.7910722Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.7911199Z 2025-05-07T20:32:48.7911641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.7912183Z 2025-05-07T20:32:48.7912301Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.7912728Z self=, 2025-05-07T20:32:48.7913138Z T=1, 2025-05-07T20:32:48.7913345Z D=5120, 2025-05-07T20:32:48.7913569Z scale_ub=1200.0, 2025-05-07T20:32:48.7913798Z contiguous=False, 2025-05-07T20:32:48.7914023Z compiled=False, 2025-05-07T20:32:48.7914236Z ) 2025-05-07T20:32:48.7914564Z self = 2025-05-07T20:32:48.7915069Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:48.7915345Z 2025-05-07T20:32:48.7915426Z @given( 2025-05-07T20:32:48.7915656Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.7915980Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.7916290Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.7916630Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.7916974Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.7917268Z ) 2025-05-07T20:32:48.7917626Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.7918086Z def test_silu_mul_quant( 2025-05-07T20:32:48.7918336Z self, 2025-05-07T20:32:48.7918532Z T: int, 2025-05-07T20:32:48.7918726Z D: int, 2025-05-07T20:32:48.7918943Z scale_ub: Optional[float], 2025-05-07T20:32:48.7919212Z contiguous: bool, 2025-05-07T20:32:48.7919451Z compiled: bool, 2025-05-07T20:32:48.7919669Z ) -> None: 2025-05-07T20:32:48.7919876Z torch.manual_seed(2025) 2025-05-07T20:32:48.7920115Z 2025-05-07T20:32:48.7920396Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.7920744Z 2025-05-07T20:32:48.7920934Z x_sign = torch.sign(x) 2025-05-07T20:32:48.7921222Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.7921531Z x = x_sign * x_clamp 2025-05-07T20:32:48.7921772Z x0 = x[:, :D] 2025-05-07T20:32:48.7921984Z x1 = x[:, D:] 2025-05-07T20:32:48.7922186Z 2025-05-07T20:32:48.7922365Z if contiguous: 2025-05-07T20:32:48.7922589Z x0 = x0.contiguous() 2025-05-07T20:32:48.7922840Z x1 = x1.contiguous() 2025-05-07T20:32:48.7923083Z 2025-05-07T20:32:48.7923295Z if scale_ub is not None: 2025-05-07T20:32:48.7923583Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.7923927Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.7924238Z ) 2025-05-07T20:32:48.7924431Z else: 2025-05-07T20:32:48.7924641Z scale_ub_tensor = None 2025-05-07T20:32:48.7924992Z 2025-05-07T20:32:48.7925227Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.7925708Z op = silu_mul_quant 2025-05-07T20:32:48.7925955Z if compiled: 2025-05-07T20:32:48.7926203Z op = torch.compile(op) 2025-05-07T20:32:48.7926633Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.7926918Z 2025-05-07T20:32:48.7927120Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.7927289Z 2025-05-07T20:32:48.7927393Z moe/activation_test.py:117: 2025-05-07T20:32:48.7927701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.7928048Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.7928336Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.7929053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.7929775Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.7930341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.7931052Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.7931752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.7932307Z kernel = self.compile( 2025-05-07T20:32:48.7932869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.7933554Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.7933966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.7934205Z 2025-05-07T20:32:48.7934496Z self = 2025-05-07T20:32:48.7935619Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.7937037Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f12793bae80>} 2025-05-07T20:32:48.7938448Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.7939528Z context = 2025-05-07T20:32:48.7939824Z 2025-05-07T20:32:48.7939997Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.7940526Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.7941013Z module_map=module_map) 2025-05-07T20:32:48.7941377Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.7941741Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.7941998Z E ^ 2025-05-07T20:32:48.7942470Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.7942948Z 2025-05-07T20:32:48.7943391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.7943930Z 2025-05-07T20:32:48.7944037Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.7944453Z self=, 2025-05-07T20:32:48.7944869Z T=16384, 2025-05-07T20:32:48.7945060Z D=5120, 2025-05-07T20:32:48.7945243Z scale_ub=1200.0, 2025-05-07T20:32:48.7945461Z contiguous=False, 2025-05-07T20:32:48.7945681Z compiled=True, 2025-05-07T20:32:48.7945878Z ) 2025-05-07T20:32:48.8812338Z self = 2025-05-07T20:32:48.8813055Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:48.8813831Z 2025-05-07T20:32:48.8814003Z @given( 2025-05-07T20:32:48.8814860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.8815479Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.8816088Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.8816742Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.8817384Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.8817952Z ) 2025-05-07T20:32:48.8818646Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.8819532Z def test_silu_mul_quant( 2025-05-07T20:32:48.8820001Z self, 2025-05-07T20:32:48.8820374Z T: int, 2025-05-07T20:32:48.8820740Z D: int, 2025-05-07T20:32:48.8821170Z scale_ub: Optional[float], 2025-05-07T20:32:48.8821711Z contiguous: bool, 2025-05-07T20:32:48.8822170Z compiled: bool, 2025-05-07T20:32:48.8822599Z ) -> None: 2025-05-07T20:32:48.8822840Z torch.manual_seed(2025) 2025-05-07T20:32:48.8823124Z 2025-05-07T20:32:48.8823401Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.8823752Z 2025-05-07T20:32:48.8823939Z x_sign = torch.sign(x) 2025-05-07T20:32:48.8824226Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.8824546Z x = x_sign * x_clamp 2025-05-07T20:32:48.8824782Z x0 = x[:, :D] 2025-05-07T20:32:48.8824990Z x1 = x[:, D:] 2025-05-07T20:32:48.8825197Z 2025-05-07T20:32:48.8825379Z if contiguous: 2025-05-07T20:32:48.8825771Z x0 = x0.contiguous() 2025-05-07T20:32:48.8826032Z x1 = x1.contiguous() 2025-05-07T20:32:48.8826279Z 2025-05-07T20:32:48.8826477Z if scale_ub is not None: 2025-05-07T20:32:48.8826756Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.8827092Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.8827399Z ) 2025-05-07T20:32:48.8827592Z else: 2025-05-07T20:32:48.8827809Z scale_ub_tensor = None 2025-05-07T20:32:48.8828061Z 2025-05-07T20:32:48.8828293Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.8828609Z op = silu_mul_quant 2025-05-07T20:32:48.8828851Z if compiled: 2025-05-07T20:32:48.8829098Z op = torch.compile(op) 2025-05-07T20:32:48.8829390Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.8829668Z 2025-05-07T20:32:48.8829848Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.8830016Z 2025-05-07T20:32:48.8830113Z moe/activation_test.py:117: 2025-05-07T20:32:48.8830409Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.8830745Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.8831023Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.8831609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:48.8832193Z return fn(*args, **kwargs) 2025-05-07T20:32:48.8832875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.8833599Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.8834161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.8834867Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.8835559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.8836112Z kernel = self.compile( 2025-05-07T20:32:48.8836790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.8837481Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.8837895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.8838249Z 2025-05-07T20:32:48.8838465Z self = 2025-05-07T20:32:48.8839586Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.8841007Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f12799ba8e0>} 2025-05-07T20:32:48.8842421Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.8843507Z context = 2025-05-07T20:32:48.8843809Z 2025-05-07T20:32:48.8843982Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.8844516Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.8845001Z module_map=module_map) 2025-05-07T20:32:48.8845372Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.8845734Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.8845995Z E ^ 2025-05-07T20:32:48.8846473Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.8846942Z 2025-05-07T20:32:48.8847396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.8847938Z 2025-05-07T20:32:48.8848041Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:48.8848462Z self=, 2025-05-07T20:32:48.8848885Z T=2048, 2025-05-07T20:32:48.8849073Z D=7168, 2025-05-07T20:32:48.8849262Z scale_ub=1200.0, 2025-05-07T20:32:48.8849485Z contiguous=False, 2025-05-07T20:32:48.8849708Z compiled=True, 2025-05-07T20:32:48.8849909Z ) 2025-05-07T20:32:48.8850230Z self = 2025-05-07T20:32:48.8850739Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:48.8851024Z 2025-05-07T20:32:48.8851103Z @given( 2025-05-07T20:32:48.8851333Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:48.8851653Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:48.8851966Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:48.8852298Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:48.8852633Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:48.8852933Z ) 2025-05-07T20:32:48.8853315Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:48.8853775Z def test_silu_mul_quant( 2025-05-07T20:32:48.8854018Z self, 2025-05-07T20:32:48.8854208Z T: int, 2025-05-07T20:32:48.8854479Z D: int, 2025-05-07T20:32:48.8854699Z scale_ub: Optional[float], 2025-05-07T20:32:48.8854963Z contiguous: bool, 2025-05-07T20:32:48.8855201Z compiled: bool, 2025-05-07T20:32:48.8855421Z ) -> None: 2025-05-07T20:32:48.8855635Z torch.manual_seed(2025) 2025-05-07T20:32:48.8855877Z 2025-05-07T20:32:48.8856171Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:48.8856523Z 2025-05-07T20:32:48.8856835Z x_sign = torch.sign(x) 2025-05-07T20:32:48.8864160Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:48.8864494Z x = x_sign * x_clamp 2025-05-07T20:32:48.8864737Z x0 = x[:, :D] 2025-05-07T20:32:48.8864965Z x1 = x[:, D:] 2025-05-07T20:32:48.8865302Z 2025-05-07T20:32:48.8865510Z if contiguous: 2025-05-07T20:32:48.8865753Z x0 = x0.contiguous() 2025-05-07T20:32:48.8866027Z x1 = x1.contiguous() 2025-05-07T20:32:48.8866286Z 2025-05-07T20:32:48.8866485Z if scale_ub is not None: 2025-05-07T20:32:48.8866773Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:48.8867123Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:48.8867437Z ) 2025-05-07T20:32:48.8867638Z else: 2025-05-07T20:32:48.8867848Z scale_ub_tensor = None 2025-05-07T20:32:48.8868100Z 2025-05-07T20:32:48.8868335Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:48.8868663Z op = silu_mul_quant 2025-05-07T20:32:48.8868917Z if compiled: 2025-05-07T20:32:48.8869166Z op = torch.compile(op) 2025-05-07T20:32:48.8869464Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.8869745Z 2025-05-07T20:32:48.8869942Z > y_fp8, y_scale = fn() 2025-05-07T20:32:48.8870113Z 2025-05-07T20:32:48.8870213Z moe/activation_test.py:117: 2025-05-07T20:32:48.8870521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.8870859Z moe/activation_test.py:115: in fn 2025-05-07T20:32:48.8871148Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:48.8871734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:48.8872320Z return fn(*args, **kwargs) 2025-05-07T20:32:48.8873007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:48.8873740Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:48.8874301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:48.8875007Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:48.8875714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:48.8876274Z kernel = self.compile( 2025-05-07T20:32:48.8876832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:48.8877525Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:48.8877939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:48.8878181Z 2025-05-07T20:32:48.8878395Z self = 2025-05-07T20:32:48.8879520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:48.8880955Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1279602840>} 2025-05-07T20:32:48.8882367Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:48.8883509Z context = 2025-05-07T20:32:48.8883811Z 2025-05-07T20:32:48.8883985Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:48.8884597Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:48.8885090Z module_map=module_map) 2025-05-07T20:32:48.8885475Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:48.8885845Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:48.8886212Z E ^ 2025-05-07T20:32:48.8886705Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:48.8887179Z 2025-05-07T20:32:48.8887622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:48.8888165Z 2025-05-07T20:32:49.0026524Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.0026985Z self=, 2025-05-07T20:32:49.0027509Z T=1, 2025-05-07T20:32:49.0027782Z D=5120, 2025-05-07T20:32:49.0028026Z scale_ub=None, 2025-05-07T20:32:49.0028241Z contiguous=False, 2025-05-07T20:32:49.0028478Z compiled=False, 2025-05-07T20:32:49.0028684Z ) 2025-05-07T20:32:49.0029003Z self = 2025-05-07T20:32:49.0029508Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:49.0029794Z 2025-05-07T20:32:49.0029872Z @given( 2025-05-07T20:32:49.0030099Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.0030413Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.0030720Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.0031056Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.0031380Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.0031669Z ) 2025-05-07T20:32:49.0032019Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.0032473Z def test_silu_mul_quant( 2025-05-07T20:32:49.0032711Z self, 2025-05-07T20:32:49.0032905Z T: int, 2025-05-07T20:32:49.0033107Z D: int, 2025-05-07T20:32:49.0033317Z scale_ub: Optional[float], 2025-05-07T20:32:49.0033591Z contiguous: bool, 2025-05-07T20:32:49.0033830Z compiled: bool, 2025-05-07T20:32:49.0034045Z ) -> None: 2025-05-07T20:32:49.0034261Z torch.manual_seed(2025) 2025-05-07T20:32:49.0034502Z 2025-05-07T20:32:49.0034772Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.0035126Z 2025-05-07T20:32:49.0035321Z x_sign = torch.sign(x) 2025-05-07T20:32:49.0035603Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.0035922Z x = x_sign * x_clamp 2025-05-07T20:32:49.0036165Z x0 = x[:, :D] 2025-05-07T20:32:49.0036372Z x1 = x[:, D:] 2025-05-07T20:32:49.0036578Z 2025-05-07T20:32:49.0036764Z if contiguous: 2025-05-07T20:32:49.0036989Z x0 = x0.contiguous() 2025-05-07T20:32:49.0037250Z x1 = x1.contiguous() 2025-05-07T20:32:49.0037502Z 2025-05-07T20:32:49.0037686Z if scale_ub is not None: 2025-05-07T20:32:49.0037959Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.0038299Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.0038622Z ) 2025-05-07T20:32:49.0038810Z else: 2025-05-07T20:32:49.0039019Z scale_ub_tensor = None 2025-05-07T20:32:49.0039270Z 2025-05-07T20:32:49.0039492Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.0039820Z op = silu_mul_quant 2025-05-07T20:32:49.0040073Z if compiled: 2025-05-07T20:32:49.0040317Z op = torch.compile(op) 2025-05-07T20:32:49.0040618Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.0040900Z 2025-05-07T20:32:49.0041086Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.0041260Z 2025-05-07T20:32:49.0041361Z moe/activation_test.py:117: 2025-05-07T20:32:49.0041856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.0042200Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.0042478Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.0043197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.0044040Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.0044595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.0045311Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.0046001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.0046557Z kernel = self.compile( 2025-05-07T20:32:49.0047116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.0047802Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.0048210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.0048447Z 2025-05-07T20:32:49.0048655Z self = 2025-05-07T20:32:49.0049784Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.0051211Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127a0362a0>} 2025-05-07T20:32:49.0052618Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.0053750Z context = 2025-05-07T20:32:49.0054046Z 2025-05-07T20:32:49.0054214Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.0054899Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.0055381Z module_map=module_map) 2025-05-07T20:32:49.0055745Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.0056102Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.0056361Z E ^ 2025-05-07T20:32:49.0056840Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.0057311Z 2025-05-07T20:32:49.0057747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.0058295Z 2025-05-07T20:32:49.0058405Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.0058829Z self=, 2025-05-07T20:32:49.0059241Z T=4096, 2025-05-07T20:32:49.0059428Z D=7168, 2025-05-07T20:32:49.0059619Z scale_ub=1200.0, 2025-05-07T20:32:49.0059843Z contiguous=False, 2025-05-07T20:32:49.0060058Z compiled=False, 2025-05-07T20:32:49.0060264Z ) 2025-05-07T20:32:49.0060588Z self = 2025-05-07T20:32:49.0061101Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:49.0061398Z 2025-05-07T20:32:49.0061476Z @given( 2025-05-07T20:32:49.0061708Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.0062018Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.0062325Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.0062661Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.0063113Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.0063419Z ) 2025-05-07T20:32:49.0063769Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.0064226Z def test_silu_mul_quant( 2025-05-07T20:32:49.0064540Z self, 2025-05-07T20:32:49.0064747Z T: int, 2025-05-07T20:32:49.0064949Z D: int, 2025-05-07T20:32:49.0065164Z scale_ub: Optional[float], 2025-05-07T20:32:49.0065449Z contiguous: bool, 2025-05-07T20:32:49.0065693Z compiled: bool, 2025-05-07T20:32:49.0065912Z ) -> None: 2025-05-07T20:32:49.0066128Z torch.manual_seed(2025) 2025-05-07T20:32:49.0066373Z 2025-05-07T20:32:49.0066640Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.0066999Z 2025-05-07T20:32:49.0067189Z x_sign = torch.sign(x) 2025-05-07T20:32:49.0067482Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.0067798Z x = x_sign * x_clamp 2025-05-07T20:32:49.0068037Z x0 = x[:, :D] 2025-05-07T20:32:49.0068255Z x1 = x[:, D:] 2025-05-07T20:32:49.0068456Z 2025-05-07T20:32:49.0068636Z if contiguous: 2025-05-07T20:32:49.0068865Z x0 = x0.contiguous() 2025-05-07T20:32:49.0069127Z x1 = x1.contiguous() 2025-05-07T20:32:49.0069373Z 2025-05-07T20:32:49.0069567Z if scale_ub is not None: 2025-05-07T20:32:49.0069835Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.0070169Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.0070485Z ) 2025-05-07T20:32:49.0070683Z else: 2025-05-07T20:32:49.0070896Z scale_ub_tensor = None 2025-05-07T20:32:49.0071151Z 2025-05-07T20:32:49.0071373Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.0071687Z op = silu_mul_quant 2025-05-07T20:32:49.0071941Z if compiled: 2025-05-07T20:32:49.0072190Z op = torch.compile(op) 2025-05-07T20:32:49.0072482Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.0072759Z 2025-05-07T20:32:49.0072945Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.0073108Z 2025-05-07T20:32:49.0073204Z moe/activation_test.py:117: 2025-05-07T20:32:49.0073505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.0073843Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.0074119Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.0074833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.0075556Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.0076112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.0076823Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.0077523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.0078084Z kernel = self.compile( 2025-05-07T20:32:49.0078641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.0079332Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.0079740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.0079978Z 2025-05-07T20:32:49.0080191Z self = 2025-05-07T20:32:49.0081306Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.0082849Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127a0ef100>} 2025-05-07T20:32:49.0084306Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.0085468Z context = 2025-05-07T20:32:49.0085769Z 2025-05-07T20:32:49.0085944Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.0086482Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.0086968Z module_map=module_map) 2025-05-07T20:32:49.0087344Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.0087707Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.0087972Z E ^ 2025-05-07T20:32:49.0088459Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.0088932Z 2025-05-07T20:32:49.0089377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.0089929Z 2025-05-07T20:32:49.0090034Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.0090458Z self=, 2025-05-07T20:32:49.0090874Z T=16384, 2025-05-07T20:32:49.0091070Z D=7168, 2025-05-07T20:32:49.0091259Z scale_ub=None, 2025-05-07T20:32:49.0091474Z contiguous=True, 2025-05-07T20:32:49.0091697Z compiled=True, 2025-05-07T20:32:49.0091895Z ) 2025-05-07T20:32:49.1832265Z self = 2025-05-07T20:32:49.1832838Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:49.1833236Z 2025-05-07T20:32:49.1833350Z @given( 2025-05-07T20:32:49.1833679Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.1834008Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.1834318Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.1834661Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.1834997Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.1835282Z ) 2025-05-07T20:32:49.1835635Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.1836094Z def test_silu_mul_quant( 2025-05-07T20:32:49.1836344Z self, 2025-05-07T20:32:49.1836542Z T: int, 2025-05-07T20:32:49.1836744Z D: int, 2025-05-07T20:32:49.1836960Z scale_ub: Optional[float], 2025-05-07T20:32:49.1837247Z contiguous: bool, 2025-05-07T20:32:49.1837495Z compiled: bool, 2025-05-07T20:32:49.1837723Z ) -> None: 2025-05-07T20:32:49.1837936Z torch.manual_seed(2025) 2025-05-07T20:32:49.1838180Z 2025-05-07T20:32:49.1838456Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.1838800Z 2025-05-07T20:32:49.1838995Z x_sign = torch.sign(x) 2025-05-07T20:32:49.1839283Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.1839594Z x = x_sign * x_clamp 2025-05-07T20:32:49.1839826Z x0 = x[:, :D] 2025-05-07T20:32:49.1840037Z x1 = x[:, D:] 2025-05-07T20:32:49.1840238Z 2025-05-07T20:32:49.1840421Z if contiguous: 2025-05-07T20:32:49.1840649Z x0 = x0.contiguous() 2025-05-07T20:32:49.1840903Z x1 = x1.contiguous() 2025-05-07T20:32:49.1841141Z 2025-05-07T20:32:49.1841325Z if scale_ub is not None: 2025-05-07T20:32:49.1841587Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.1841923Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.1842239Z ) 2025-05-07T20:32:49.1842429Z else: 2025-05-07T20:32:49.1842797Z scale_ub_tensor = None 2025-05-07T20:32:49.1843062Z 2025-05-07T20:32:49.1843290Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.1843602Z op = silu_mul_quant 2025-05-07T20:32:49.1843964Z if compiled: 2025-05-07T20:32:49.1844210Z op = torch.compile(op) 2025-05-07T20:32:49.1844501Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.1844778Z 2025-05-07T20:32:49.1844972Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.1845137Z 2025-05-07T20:32:49.1845234Z moe/activation_test.py:117: 2025-05-07T20:32:49.1845530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.1845867Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.1846145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.1846720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:49.1847310Z return fn(*args, **kwargs) 2025-05-07T20:32:49.1847992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.1848707Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.1849263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.1849972Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.1850661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.1851214Z kernel = self.compile( 2025-05-07T20:32:49.1851770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.1852457Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.1852868Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.1853111Z 2025-05-07T20:32:49.1853316Z self = 2025-05-07T20:32:49.1854533Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.1855958Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127a6a5300>} 2025-05-07T20:32:49.1857364Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.1858434Z context = 2025-05-07T20:32:49.1858734Z 2025-05-07T20:32:49.1858906Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.1859437Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.1859915Z module_map=module_map) 2025-05-07T20:32:49.1860284Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.1860640Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.1860893Z E ^ 2025-05-07T20:32:49.1861359Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.1861827Z 2025-05-07T20:32:49.1862261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.1862803Z 2025-05-07T20:32:49.1862905Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.1863404Z self=, 2025-05-07T20:32:49.1863813Z T=4096, 2025-05-07T20:32:49.1863998Z D=5120, 2025-05-07T20:32:49.1864187Z scale_ub=None, 2025-05-07T20:32:49.1864393Z contiguous=False, 2025-05-07T20:32:49.1864640Z compiled=True, 2025-05-07T20:32:49.1864914Z ) 2025-05-07T20:32:49.1865236Z self = 2025-05-07T20:32:49.1865737Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:49.1866024Z 2025-05-07T20:32:49.1866101Z @given( 2025-05-07T20:32:49.1866324Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.1866635Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.1866942Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.1867269Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.1867599Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.1867881Z ) 2025-05-07T20:32:49.1868238Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.1868687Z def test_silu_mul_quant( 2025-05-07T20:32:49.1868920Z self, 2025-05-07T20:32:49.1869112Z T: int, 2025-05-07T20:32:49.1869308Z D: int, 2025-05-07T20:32:49.1869522Z scale_ub: Optional[float], 2025-05-07T20:32:49.1869793Z contiguous: bool, 2025-05-07T20:32:49.1870030Z compiled: bool, 2025-05-07T20:32:49.1870242Z ) -> None: 2025-05-07T20:32:49.1870450Z torch.manual_seed(2025) 2025-05-07T20:32:49.1870689Z 2025-05-07T20:32:49.1870956Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.1871312Z 2025-05-07T20:32:49.1871508Z x_sign = torch.sign(x) 2025-05-07T20:32:49.1871797Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.1872107Z x = x_sign * x_clamp 2025-05-07T20:32:49.1872356Z x0 = x[:, :D] 2025-05-07T20:32:49.1872571Z x1 = x[:, D:] 2025-05-07T20:32:49.1872782Z 2025-05-07T20:32:49.1872969Z if contiguous: 2025-05-07T20:32:49.1873204Z x0 = x0.contiguous() 2025-05-07T20:32:49.1873485Z x1 = x1.contiguous() 2025-05-07T20:32:49.1873758Z 2025-05-07T20:32:49.1873952Z if scale_ub is not None: 2025-05-07T20:32:49.1874231Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.1874575Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.1874895Z ) 2025-05-07T20:32:49.1875095Z else: 2025-05-07T20:32:49.1875311Z scale_ub_tensor = None 2025-05-07T20:32:49.1875573Z 2025-05-07T20:32:49.1875802Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.1876120Z op = silu_mul_quant 2025-05-07T20:32:49.1876370Z if compiled: 2025-05-07T20:32:49.1876617Z op = torch.compile(op) 2025-05-07T20:32:49.1876910Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.1877191Z 2025-05-07T20:32:49.1877385Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.1877548Z 2025-05-07T20:32:49.1877648Z moe/activation_test.py:117: 2025-05-07T20:32:49.1877948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.1878297Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.1878575Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.1879156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:49.1879743Z return fn(*args, **kwargs) 2025-05-07T20:32:49.1880425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.1881153Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.1881708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.1882504Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.1883246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.1883809Z kernel = self.compile( 2025-05-07T20:32:49.1884472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.1885159Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.1885572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.1885820Z 2025-05-07T20:32:49.1886031Z self = 2025-05-07T20:32:49.1893566Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.1895094Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ac44360>} 2025-05-07T20:32:49.1896515Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.1897607Z context = 2025-05-07T20:32:49.1897911Z 2025-05-07T20:32:49.1898087Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.1898632Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.1899122Z module_map=module_map) 2025-05-07T20:32:49.1899490Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.1899852Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.1900128Z E ^ 2025-05-07T20:32:49.1900606Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.1901078Z 2025-05-07T20:32:49.1901515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.1902067Z 2025-05-07T20:32:49.3350147Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.3350599Z self=, 2025-05-07T20:32:49.3351176Z T=4096, 2025-05-07T20:32:49.3351445Z D=5120, 2025-05-07T20:32:49.3351648Z scale_ub=1200.0, 2025-05-07T20:32:49.3351872Z contiguous=False, 2025-05-07T20:32:49.3352098Z compiled=False, 2025-05-07T20:32:49.3352311Z ) 2025-05-07T20:32:49.3352626Z self = 2025-05-07T20:32:49.3353138Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:49.3353435Z 2025-05-07T20:32:49.3353517Z @given( 2025-05-07T20:32:49.3353741Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.3354061Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.3354375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.3354716Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.3355050Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.3355344Z ) 2025-05-07T20:32:49.3355701Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.3356151Z def test_silu_mul_quant( 2025-05-07T20:32:49.3356399Z self, 2025-05-07T20:32:49.3356596Z T: int, 2025-05-07T20:32:49.3356792Z D: int, 2025-05-07T20:32:49.3357014Z scale_ub: Optional[float], 2025-05-07T20:32:49.3357290Z contiguous: bool, 2025-05-07T20:32:49.3357530Z compiled: bool, 2025-05-07T20:32:49.3357755Z ) -> None: 2025-05-07T20:32:49.3358148Z torch.manual_seed(2025) 2025-05-07T20:32:49.3358400Z 2025-05-07T20:32:49.3358677Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.3359039Z 2025-05-07T20:32:49.3359241Z x_sign = torch.sign(x) 2025-05-07T20:32:49.3359648Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.3359973Z x = x_sign * x_clamp 2025-05-07T20:32:49.3360218Z x0 = x[:, :D] 2025-05-07T20:32:49.3360436Z x1 = x[:, D:] 2025-05-07T20:32:49.3360647Z 2025-05-07T20:32:49.3360836Z if contiguous: 2025-05-07T20:32:49.3361068Z x0 = x0.contiguous() 2025-05-07T20:32:49.3361342Z x1 = x1.contiguous() 2025-05-07T20:32:49.3361595Z 2025-05-07T20:32:49.3361791Z if scale_ub is not None: 2025-05-07T20:32:49.3362074Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.3362415Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.3362733Z ) 2025-05-07T20:32:49.3362931Z else: 2025-05-07T20:32:49.3363147Z scale_ub_tensor = None 2025-05-07T20:32:49.3363415Z 2025-05-07T20:32:49.3363696Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.3364031Z op = silu_mul_quant 2025-05-07T20:32:49.3364292Z if compiled: 2025-05-07T20:32:49.3364546Z op = torch.compile(op) 2025-05-07T20:32:49.3364852Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.3365140Z 2025-05-07T20:32:49.3365334Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.3365504Z 2025-05-07T20:32:49.3365604Z moe/activation_test.py:117: 2025-05-07T20:32:49.3365906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.3366241Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.3366532Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.3367265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.3367999Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.3368557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.3369280Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.3369978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.3370535Z kernel = self.compile( 2025-05-07T20:32:49.3371105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.3371795Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.3372208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.3372444Z 2025-05-07T20:32:49.3372661Z self = 2025-05-07T20:32:49.3373836Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.3375374Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ac46700>} 2025-05-07T20:32:49.3376791Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.3377879Z context = 2025-05-07T20:32:49.3378179Z 2025-05-07T20:32:49.3378349Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.3378978Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.3379466Z module_map=module_map) 2025-05-07T20:32:49.3379841Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.3380282Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.3380553Z E ^ 2025-05-07T20:32:49.3381032Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.3381508Z 2025-05-07T20:32:49.3381946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.3382492Z 2025-05-07T20:32:49.3382597Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.3383025Z self=, 2025-05-07T20:32:49.3383463Z T=4096, 2025-05-07T20:32:49.3383686Z D=5120, 2025-05-07T20:32:49.3383888Z scale_ub=1200.0, 2025-05-07T20:32:49.3384114Z contiguous=False, 2025-05-07T20:32:49.3384345Z compiled=True, 2025-05-07T20:32:49.3384556Z ) 2025-05-07T20:32:49.3384886Z self = 2025-05-07T20:32:49.3385402Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:49.3385691Z 2025-05-07T20:32:49.3385771Z @given( 2025-05-07T20:32:49.3386005Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.3386324Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.3386649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.3386986Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.3387321Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.3387611Z ) 2025-05-07T20:32:49.3387969Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.3388421Z def test_silu_mul_quant( 2025-05-07T20:32:49.3388666Z self, 2025-05-07T20:32:49.3388866Z T: int, 2025-05-07T20:32:49.3389059Z D: int, 2025-05-07T20:32:49.3389274Z scale_ub: Optional[float], 2025-05-07T20:32:49.3389551Z contiguous: bool, 2025-05-07T20:32:49.3389795Z compiled: bool, 2025-05-07T20:32:49.3390016Z ) -> None: 2025-05-07T20:32:49.3390232Z torch.manual_seed(2025) 2025-05-07T20:32:49.3390477Z 2025-05-07T20:32:49.3390746Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.3391104Z 2025-05-07T20:32:49.3391297Z x_sign = torch.sign(x) 2025-05-07T20:32:49.3391584Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.3391899Z x = x_sign * x_clamp 2025-05-07T20:32:49.3392136Z x0 = x[:, :D] 2025-05-07T20:32:49.3392347Z x1 = x[:, D:] 2025-05-07T20:32:49.3392556Z 2025-05-07T20:32:49.3392743Z if contiguous: 2025-05-07T20:32:49.3392974Z x0 = x0.contiguous() 2025-05-07T20:32:49.3393252Z x1 = x1.contiguous() 2025-05-07T20:32:49.3393533Z 2025-05-07T20:32:49.3393750Z if scale_ub is not None: 2025-05-07T20:32:49.3394024Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.3394371Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.3394689Z ) 2025-05-07T20:32:49.3394879Z else: 2025-05-07T20:32:49.3395089Z scale_ub_tensor = None 2025-05-07T20:32:49.3395344Z 2025-05-07T20:32:49.3395571Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.3395893Z op = silu_mul_quant 2025-05-07T20:32:49.3396149Z if compiled: 2025-05-07T20:32:49.3396393Z op = torch.compile(op) 2025-05-07T20:32:49.3396691Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.3396970Z 2025-05-07T20:32:49.3397160Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.3397327Z 2025-05-07T20:32:49.3397510Z moe/activation_test.py:117: 2025-05-07T20:32:49.3397811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.3398153Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.3398434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.3399085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:49.3399673Z return fn(*args, **kwargs) 2025-05-07T20:32:49.3400354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.3401073Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.3401634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.3402346Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.3403041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.3403651Z kernel = self.compile( 2025-05-07T20:32:49.3404211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.3404899Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.3405311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.3405553Z 2025-05-07T20:32:49.3405763Z self = 2025-05-07T20:32:49.3406881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.3408310Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127b37b2e0>} 2025-05-07T20:32:49.3409710Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.3410796Z context = 2025-05-07T20:32:49.3411100Z 2025-05-07T20:32:49.3411271Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.3411805Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.3412283Z module_map=module_map) 2025-05-07T20:32:49.3412661Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.3413023Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.3413287Z E ^ 2025-05-07T20:32:49.3413817Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.3414289Z 2025-05-07T20:32:49.3414813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.3415359Z 2025-05-07T20:32:49.4547452Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.4547933Z self=, 2025-05-07T20:32:49.4548543Z T=2048, 2025-05-07T20:32:49.4548800Z D=7168, 2025-05-07T20:32:49.4549045Z scale_ub=1200.0, 2025-05-07T20:32:49.4549325Z contiguous=False, 2025-05-07T20:32:49.4549615Z compiled=False, 2025-05-07T20:32:49.4549865Z ) 2025-05-07T20:32:49.4550194Z self = 2025-05-07T20:32:49.4550718Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:49.4551014Z 2025-05-07T20:32:49.4551098Z @given( 2025-05-07T20:32:49.4551509Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.4551834Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.4552150Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.4552485Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.4552962Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.4553306Z ) 2025-05-07T20:32:49.4553663Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.4554117Z def test_silu_mul_quant( 2025-05-07T20:32:49.4554364Z self, 2025-05-07T20:32:49.4554560Z T: int, 2025-05-07T20:32:49.4554756Z D: int, 2025-05-07T20:32:49.4554979Z scale_ub: Optional[float], 2025-05-07T20:32:49.4555259Z contiguous: bool, 2025-05-07T20:32:49.4555504Z compiled: bool, 2025-05-07T20:32:49.4555724Z ) -> None: 2025-05-07T20:32:49.4555951Z torch.manual_seed(2025) 2025-05-07T20:32:49.4556197Z 2025-05-07T20:32:49.4556477Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.4556831Z 2025-05-07T20:32:49.4557029Z x_sign = torch.sign(x) 2025-05-07T20:32:49.4557319Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.4557643Z x = x_sign * x_clamp 2025-05-07T20:32:49.4557886Z x0 = x[:, :D] 2025-05-07T20:32:49.4558106Z x1 = x[:, D:] 2025-05-07T20:32:49.4558315Z 2025-05-07T20:32:49.4558506Z if contiguous: 2025-05-07T20:32:49.4558737Z x0 = x0.contiguous() 2025-05-07T20:32:49.4559000Z x1 = x1.contiguous() 2025-05-07T20:32:49.4559242Z 2025-05-07T20:32:49.4559430Z if scale_ub is not None: 2025-05-07T20:32:49.4559705Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.4560041Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.4560351Z ) 2025-05-07T20:32:49.4560552Z else: 2025-05-07T20:32:49.4560773Z scale_ub_tensor = None 2025-05-07T20:32:49.4561038Z 2025-05-07T20:32:49.4561272Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.4561592Z op = silu_mul_quant 2025-05-07T20:32:49.4561845Z if compiled: 2025-05-07T20:32:49.4562126Z op = torch.compile(op) 2025-05-07T20:32:49.4562419Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.4562701Z 2025-05-07T20:32:49.4562899Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.4563088Z 2025-05-07T20:32:49.4563199Z moe/activation_test.py:117: 2025-05-07T20:32:49.4563518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.4563865Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.4564146Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.4564864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.4565598Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.4566156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.4566869Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.4567571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.4568131Z kernel = self.compile( 2025-05-07T20:32:49.4568691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.4569378Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.4569781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.4570016Z 2025-05-07T20:32:49.4570231Z self = 2025-05-07T20:32:49.4571432Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.4572867Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f129592a340>} 2025-05-07T20:32:49.4574352Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.4575609Z context = 2025-05-07T20:32:49.4575911Z 2025-05-07T20:32:49.4576085Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.4576626Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.4577120Z module_map=module_map) 2025-05-07T20:32:49.4577503Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.4577872Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.4578141Z E ^ 2025-05-07T20:32:49.4578633Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.4579110Z 2025-05-07T20:32:49.4579552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.4580095Z 2025-05-07T20:32:49.4580201Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.4580627Z self=, 2025-05-07T20:32:49.4581048Z T=1, 2025-05-07T20:32:49.4581241Z D=7168, 2025-05-07T20:32:49.4581437Z scale_ub=None, 2025-05-07T20:32:49.4581659Z contiguous=True, 2025-05-07T20:32:49.4581889Z compiled=False, 2025-05-07T20:32:49.4582101Z ) 2025-05-07T20:32:49.4582431Z self = 2025-05-07T20:32:49.4582939Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:49.4583210Z 2025-05-07T20:32:49.4583297Z @given( 2025-05-07T20:32:49.4583530Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.4583856Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.4584175Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.4584514Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.4584857Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.4585157Z ) 2025-05-07T20:32:49.4585512Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.4585975Z def test_silu_mul_quant( 2025-05-07T20:32:49.4586226Z self, 2025-05-07T20:32:49.4586427Z T: int, 2025-05-07T20:32:49.4586633Z D: int, 2025-05-07T20:32:49.4586859Z scale_ub: Optional[float], 2025-05-07T20:32:49.4587131Z contiguous: bool, 2025-05-07T20:32:49.4587372Z compiled: bool, 2025-05-07T20:32:49.4587593Z ) -> None: 2025-05-07T20:32:49.4587804Z torch.manual_seed(2025) 2025-05-07T20:32:49.4588054Z 2025-05-07T20:32:49.4588332Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.4588680Z 2025-05-07T20:32:49.4588883Z x_sign = torch.sign(x) 2025-05-07T20:32:49.4589177Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.4589490Z x = x_sign * x_clamp 2025-05-07T20:32:49.4589731Z x0 = x[:, :D] 2025-05-07T20:32:49.4589952Z x1 = x[:, D:] 2025-05-07T20:32:49.4590166Z 2025-05-07T20:32:49.4590345Z if contiguous: 2025-05-07T20:32:49.4590575Z x0 = x0.contiguous() 2025-05-07T20:32:49.4590838Z x1 = x1.contiguous() 2025-05-07T20:32:49.4591080Z 2025-05-07T20:32:49.4591357Z if scale_ub is not None: 2025-05-07T20:32:49.4591637Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.4591967Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.4592279Z ) 2025-05-07T20:32:49.4592550Z else: 2025-05-07T20:32:49.4592757Z scale_ub_tensor = None 2025-05-07T20:32:49.4593011Z 2025-05-07T20:32:49.4593240Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.4593579Z op = silu_mul_quant 2025-05-07T20:32:49.4593858Z if compiled: 2025-05-07T20:32:49.4594106Z op = torch.compile(op) 2025-05-07T20:32:49.4594403Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.4594684Z 2025-05-07T20:32:49.4594876Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.4595043Z 2025-05-07T20:32:49.4595143Z moe/activation_test.py:117: 2025-05-07T20:32:49.4595439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.4595784Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.4596072Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.4596786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.4597523Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.4598079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.4598792Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.4599484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.4600042Z kernel = self.compile( 2025-05-07T20:32:49.4600606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.4601293Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.4601704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.4601944Z 2025-05-07T20:32:49.4602153Z self = 2025-05-07T20:32:49.4603303Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.4604743Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1295a031a0>} 2025-05-07T20:32:49.4606155Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.4607244Z context = 2025-05-07T20:32:49.4607543Z 2025-05-07T20:32:49.4607717Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.4608256Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.4608741Z module_map=module_map) 2025-05-07T20:32:49.4609110Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.4609471Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.4609731Z E ^ 2025-05-07T20:32:49.4610208Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.4610680Z 2025-05-07T20:32:49.4611126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.4611666Z 2025-05-07T20:32:49.4611773Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.4612275Z self=, 2025-05-07T20:32:49.4612693Z T=16384, 2025-05-07T20:32:49.4612888Z D=7168, 2025-05-07T20:32:49.4613077Z scale_ub=1200.0, 2025-05-07T20:32:49.4613301Z contiguous=False, 2025-05-07T20:32:49.4613630Z compiled=True, 2025-05-07T20:32:49.7008759Z ) 2025-05-07T20:32:49.7009457Z self = 2025-05-07T20:32:49.7010579Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:49.7011415Z 2025-05-07T20:32:49.7011579Z @given( 2025-05-07T20:32:49.7012037Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.7012715Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.7013253Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.7013592Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.7013930Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.7014233Z ) 2025-05-07T20:32:49.7014700Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.7015153Z def test_silu_mul_quant( 2025-05-07T20:32:49.7015395Z self, 2025-05-07T20:32:49.7015597Z T: int, 2025-05-07T20:32:49.7015792Z D: int, 2025-05-07T20:32:49.7016011Z scale_ub: Optional[float], 2025-05-07T20:32:49.7016296Z contiguous: bool, 2025-05-07T20:32:49.7022931Z compiled: bool, 2025-05-07T20:32:49.7023215Z ) -> None: 2025-05-07T20:32:49.7023433Z torch.manual_seed(2025) 2025-05-07T20:32:49.7023680Z 2025-05-07T20:32:49.7023961Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.7024321Z 2025-05-07T20:32:49.7024511Z x_sign = torch.sign(x) 2025-05-07T20:32:49.7024806Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.7025125Z x = x_sign * x_clamp 2025-05-07T20:32:49.7025370Z x0 = x[:, :D] 2025-05-07T20:32:49.7025761Z x1 = x[:, D:] 2025-05-07T20:32:49.7025974Z 2025-05-07T20:32:49.7026158Z if contiguous: 2025-05-07T20:32:49.7026392Z x0 = x0.contiguous() 2025-05-07T20:32:49.7026657Z x1 = x1.contiguous() 2025-05-07T20:32:49.7026908Z 2025-05-07T20:32:49.7027106Z if scale_ub is not None: 2025-05-07T20:32:49.7027387Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.7027723Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.7028036Z ) 2025-05-07T20:32:49.7028229Z else: 2025-05-07T20:32:49.7028434Z scale_ub_tensor = None 2025-05-07T20:32:49.7028688Z 2025-05-07T20:32:49.7028916Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.7029232Z op = silu_mul_quant 2025-05-07T20:32:49.7029479Z if compiled: 2025-05-07T20:32:49.7029726Z op = torch.compile(op) 2025-05-07T20:32:49.7030038Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.7030316Z 2025-05-07T20:32:49.7030509Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.7030677Z 2025-05-07T20:32:49.7030780Z moe/activation_test.py:117: 2025-05-07T20:32:49.7031078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.7031421Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.7031705Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.7032282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:49.7032865Z return fn(*args, **kwargs) 2025-05-07T20:32:49.7033547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.7034269Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.7034989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.7035703Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.7036395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.7037060Z kernel = self.compile( 2025-05-07T20:32:49.7037610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.7038292Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.7038691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.7038925Z 2025-05-07T20:32:49.7039132Z self = 2025-05-07T20:32:49.7040255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.7041686Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f129587da80>} 2025-05-07T20:32:49.7043096Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.7044178Z context = 2025-05-07T20:32:49.7044472Z 2025-05-07T20:32:49.7044641Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.7045174Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.7045652Z module_map=module_map) 2025-05-07T20:32:49.7046018Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.7046377Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.7046640Z E ^ 2025-05-07T20:32:49.7047118Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.7047591Z 2025-05-07T20:32:49.7048028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.7048575Z 2025-05-07T20:32:49.7048683Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.7049107Z self=, 2025-05-07T20:32:49.7049519Z T=1, 2025-05-07T20:32:49.7049697Z D=7168, 2025-05-07T20:32:49.7049889Z scale_ub=None, 2025-05-07T20:32:49.7050102Z contiguous=False, 2025-05-07T20:32:49.7050321Z compiled=False, 2025-05-07T20:32:49.7050526Z ) 2025-05-07T20:32:49.7050843Z self = 2025-05-07T20:32:49.7051346Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:49.7051624Z 2025-05-07T20:32:49.7051707Z @given( 2025-05-07T20:32:49.7051942Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.7052268Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.7052586Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.7052927Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.7053269Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.7053567Z ) 2025-05-07T20:32:49.7053979Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.7054500Z def test_silu_mul_quant( 2025-05-07T20:32:49.7054744Z self, 2025-05-07T20:32:49.7054952Z T: int, 2025-05-07T20:32:49.7055145Z D: int, 2025-05-07T20:32:49.7055359Z scale_ub: Optional[float], 2025-05-07T20:32:49.7055633Z contiguous: bool, 2025-05-07T20:32:49.7055961Z compiled: bool, 2025-05-07T20:32:49.7056184Z ) -> None: 2025-05-07T20:32:49.7056399Z torch.manual_seed(2025) 2025-05-07T20:32:49.7056643Z 2025-05-07T20:32:49.7056919Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.7057347Z 2025-05-07T20:32:49.7057540Z x_sign = torch.sign(x) 2025-05-07T20:32:49.7057833Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.7058144Z x = x_sign * x_clamp 2025-05-07T20:32:49.7058383Z x0 = x[:, :D] 2025-05-07T20:32:49.7058600Z x1 = x[:, D:] 2025-05-07T20:32:49.7058797Z 2025-05-07T20:32:49.7058980Z if contiguous: 2025-05-07T20:32:49.7059209Z x0 = x0.contiguous() 2025-05-07T20:32:49.7059465Z x1 = x1.contiguous() 2025-05-07T20:32:49.7059714Z 2025-05-07T20:32:49.7059904Z if scale_ub is not None: 2025-05-07T20:32:49.7060177Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.7060521Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.7060844Z ) 2025-05-07T20:32:49.7061034Z else: 2025-05-07T20:32:49.7061243Z scale_ub_tensor = None 2025-05-07T20:32:49.7061496Z 2025-05-07T20:32:49.7061729Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.7062045Z op = silu_mul_quant 2025-05-07T20:32:49.7062300Z if compiled: 2025-05-07T20:32:49.7062547Z op = torch.compile(op) 2025-05-07T20:32:49.7062852Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.7063131Z 2025-05-07T20:32:49.7063324Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.7063488Z 2025-05-07T20:32:49.7063591Z moe/activation_test.py:117: 2025-05-07T20:32:49.7063932Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.7064271Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.7064546Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.7065268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.7065995Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.7066552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.7067269Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.7067964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.7068519Z kernel = self.compile( 2025-05-07T20:32:49.7069074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.7069764Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.7070170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.7070408Z 2025-05-07T20:32:49.7070622Z self = 2025-05-07T20:32:49.7071737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.7073160Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f12967d0f40>} 2025-05-07T20:32:49.7074561Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.7075648Z context = 2025-05-07T20:32:49.7075953Z 2025-05-07T20:32:49.7076210Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.7076749Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.7077230Z module_map=module_map) 2025-05-07T20:32:49.7077672Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.7078027Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.7078290Z E ^ 2025-05-07T20:32:49.7078764Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.7079233Z 2025-05-07T20:32:49.7079675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.7080214Z 2025-05-07T20:32:49.7080314Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.7080731Z self=, 2025-05-07T20:32:49.7081140Z T=2048, 2025-05-07T20:32:49.7081329Z D=7168, 2025-05-07T20:32:49.7081522Z scale_ub=None, 2025-05-07T20:32:49.7081735Z contiguous=False, 2025-05-07T20:32:49.7081956Z compiled=True, 2025-05-07T20:32:49.7082154Z ) 2025-05-07T20:32:49.7941126Z self = 2025-05-07T20:32:49.7941717Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:49.7942081Z 2025-05-07T20:32:49.7942195Z @given( 2025-05-07T20:32:49.7942500Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.7942818Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.7943135Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.7943467Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.7943808Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.7944101Z ) 2025-05-07T20:32:49.7944457Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.7944915Z def test_silu_mul_quant( 2025-05-07T20:32:49.7945158Z self, 2025-05-07T20:32:49.7945351Z T: int, 2025-05-07T20:32:49.7945547Z D: int, 2025-05-07T20:32:49.7945764Z scale_ub: Optional[float], 2025-05-07T20:32:49.7946039Z contiguous: bool, 2025-05-07T20:32:49.7946283Z compiled: bool, 2025-05-07T20:32:49.7946501Z ) -> None: 2025-05-07T20:32:49.7946709Z torch.manual_seed(2025) 2025-05-07T20:32:49.7946959Z 2025-05-07T20:32:49.7947233Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.7947591Z 2025-05-07T20:32:49.7947779Z x_sign = torch.sign(x) 2025-05-07T20:32:49.7948068Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.7948383Z x = x_sign * x_clamp 2025-05-07T20:32:49.7948615Z x0 = x[:, :D] 2025-05-07T20:32:49.7948828Z x1 = x[:, D:] 2025-05-07T20:32:49.7949043Z 2025-05-07T20:32:49.7949233Z if contiguous: 2025-05-07T20:32:49.7949469Z x0 = x0.contiguous() 2025-05-07T20:32:49.7949739Z x1 = x1.contiguous() 2025-05-07T20:32:49.7949973Z 2025-05-07T20:32:49.7950172Z if scale_ub is not None: 2025-05-07T20:32:49.7950447Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.7950779Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.7951089Z ) 2025-05-07T20:32:49.7951281Z else: 2025-05-07T20:32:49.7951488Z scale_ub_tensor = None 2025-05-07T20:32:49.7951736Z 2025-05-07T20:32:49.7951964Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.7952278Z op = silu_mul_quant 2025-05-07T20:32:49.7952520Z if compiled: 2025-05-07T20:32:49.7952764Z op = torch.compile(op) 2025-05-07T20:32:49.7953057Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.7953327Z 2025-05-07T20:32:49.7953519Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.7953906Z 2025-05-07T20:32:49.7954016Z moe/activation_test.py:117: 2025-05-07T20:32:49.7954310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.7954647Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.7955081Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.7955666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:49.7956282Z return fn(*args, **kwargs) 2025-05-07T20:32:49.7956960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.7957688Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.7958246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.7958955Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.7959664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.7960219Z kernel = self.compile( 2025-05-07T20:32:49.7960777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.7961466Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.7961878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.7962117Z 2025-05-07T20:32:49.7962332Z self = 2025-05-07T20:32:49.7963456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.7964883Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f129f27f7e0>} 2025-05-07T20:32:49.7966285Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.7967375Z context = 2025-05-07T20:32:49.7967671Z 2025-05-07T20:32:49.7967844Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.7968381Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.7968859Z module_map=module_map) 2025-05-07T20:32:49.7969225Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.7969583Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.7969835Z E ^ 2025-05-07T20:32:49.7970311Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.7970779Z 2025-05-07T20:32:49.7971223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.7971768Z 2025-05-07T20:32:49.7971870Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.7972282Z self=, 2025-05-07T20:32:49.7972691Z T=4096, 2025-05-07T20:32:49.7972875Z D=7168, 2025-05-07T20:32:49.7973071Z scale_ub=None, 2025-05-07T20:32:49.7973322Z contiguous=False, 2025-05-07T20:32:49.7973543Z compiled=True, 2025-05-07T20:32:49.7973737Z ) 2025-05-07T20:32:49.7974057Z self = 2025-05-07T20:32:49.7974658Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:49.7974939Z 2025-05-07T20:32:49.7975100Z @given( 2025-05-07T20:32:49.7975330Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.7975641Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.7975947Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.7976352Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.7976687Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.7976983Z ) 2025-05-07T20:32:49.7977330Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.7977792Z def test_silu_mul_quant( 2025-05-07T20:32:49.7978038Z self, 2025-05-07T20:32:49.7978231Z T: int, 2025-05-07T20:32:49.7978434Z D: int, 2025-05-07T20:32:49.7978655Z scale_ub: Optional[float], 2025-05-07T20:32:49.7978933Z contiguous: bool, 2025-05-07T20:32:49.7979174Z compiled: bool, 2025-05-07T20:32:49.7979393Z ) -> None: 2025-05-07T20:32:49.7979599Z torch.manual_seed(2025) 2025-05-07T20:32:49.7979843Z 2025-05-07T20:32:49.7980121Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.7980469Z 2025-05-07T20:32:49.7980655Z x_sign = torch.sign(x) 2025-05-07T20:32:49.7980951Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.7981260Z x = x_sign * x_clamp 2025-05-07T20:32:49.7981493Z x0 = x[:, :D] 2025-05-07T20:32:49.7981703Z x1 = x[:, D:] 2025-05-07T20:32:49.7981905Z 2025-05-07T20:32:49.7982084Z if contiguous: 2025-05-07T20:32:49.7982316Z x0 = x0.contiguous() 2025-05-07T20:32:49.7982572Z x1 = x1.contiguous() 2025-05-07T20:32:49.7982810Z 2025-05-07T20:32:49.7982998Z if scale_ub is not None: 2025-05-07T20:32:49.7983267Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.7983628Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.7983956Z ) 2025-05-07T20:32:49.7984160Z else: 2025-05-07T20:32:49.7984374Z scale_ub_tensor = None 2025-05-07T20:32:49.7984627Z 2025-05-07T20:32:49.7984859Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.7985170Z op = silu_mul_quant 2025-05-07T20:32:49.7985424Z if compiled: 2025-05-07T20:32:49.7985670Z op = torch.compile(op) 2025-05-07T20:32:49.7985967Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.7986237Z 2025-05-07T20:32:49.7986424Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.7986586Z 2025-05-07T20:32:49.7986687Z moe/activation_test.py:117: 2025-05-07T20:32:49.7986981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.7987316Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.7987600Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.7988168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:49.7988760Z return fn(*args, **kwargs) 2025-05-07T20:32:49.7989442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.7990165Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.7990719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.7991433Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.7992130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.7992684Z kernel = self.compile( 2025-05-07T20:32:49.7993235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.7993922Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.7994415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.7994655Z 2025-05-07T20:32:49.7994862Z self = 2025-05-07T20:32:49.7995976Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.7997471Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f129674f060>} 2025-05-07T20:32:49.7998873Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.7999954Z context = 2025-05-07T20:32:49.8000254Z 2025-05-07T20:32:49.8000420Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.8000951Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.8001439Z module_map=module_map) 2025-05-07T20:32:49.8001798Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.8002153Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.8002408Z E ^ 2025-05-07T20:32:49.8002878Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.8003374Z 2025-05-07T20:32:49.8003832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.8004375Z 2025-05-07T20:32:49.9592189Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.9593710Z self=, 2025-05-07T20:32:49.9594397Z T=16384, 2025-05-07T20:32:49.9594622Z D=5120, 2025-05-07T20:32:49.9594837Z scale_ub=1200.0, 2025-05-07T20:32:49.9595069Z contiguous=False, 2025-05-07T20:32:49.9595316Z compiled=False, 2025-05-07T20:32:49.9595548Z ) 2025-05-07T20:32:49.9595881Z self = 2025-05-07T20:32:49.9596423Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:49.9596724Z 2025-05-07T20:32:49.9596816Z @given( 2025-05-07T20:32:49.9597055Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.9597393Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.9597721Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.9598071Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.9598418Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.9598727Z ) 2025-05-07T20:32:49.9599100Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.9599558Z def test_silu_mul_quant( 2025-05-07T20:32:49.9599814Z self, 2025-05-07T20:32:49.9600018Z T: int, 2025-05-07T20:32:49.9600224Z D: int, 2025-05-07T20:32:49.9600461Z scale_ub: Optional[float], 2025-05-07T20:32:49.9600752Z contiguous: bool, 2025-05-07T20:32:49.9601000Z compiled: bool, 2025-05-07T20:32:49.9601242Z ) -> None: 2025-05-07T20:32:49.9601469Z torch.manual_seed(2025) 2025-05-07T20:32:49.9601716Z 2025-05-07T20:32:49.9602005Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.9602371Z 2025-05-07T20:32:49.9602566Z x_sign = torch.sign(x) 2025-05-07T20:32:49.9602871Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.9603196Z x = x_sign * x_clamp 2025-05-07T20:32:49.9603447Z x0 = x[:, :D] 2025-05-07T20:32:49.9604059Z x1 = x[:, D:] 2025-05-07T20:32:49.9604282Z 2025-05-07T20:32:49.9604481Z if contiguous: 2025-05-07T20:32:49.9604719Z x0 = x0.contiguous() 2025-05-07T20:32:49.9604997Z x1 = x1.contiguous() 2025-05-07T20:32:49.9605252Z 2025-05-07T20:32:49.9605614Z if scale_ub is not None: 2025-05-07T20:32:49.9605906Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.9606260Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.9606580Z ) 2025-05-07T20:32:49.9606784Z else: 2025-05-07T20:32:49.9607002Z scale_ub_tensor = None 2025-05-07T20:32:49.9607262Z 2025-05-07T20:32:49.9607506Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.9607837Z op = silu_mul_quant 2025-05-07T20:32:49.9608094Z if compiled: 2025-05-07T20:32:49.9608356Z op = torch.compile(op) 2025-05-07T20:32:49.9608671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.9608970Z 2025-05-07T20:32:49.9609166Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.9609347Z 2025-05-07T20:32:49.9609453Z moe/activation_test.py:117: 2025-05-07T20:32:49.9609773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.9610123Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.9610418Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.9611157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.9611893Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.9612463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.9613206Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.9613903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.9614638Z kernel = self.compile( 2025-05-07T20:32:49.9615231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.9615937Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.9616369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.9616611Z 2025-05-07T20:32:49.9616838Z self = 2025-05-07T20:32:49.9617980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.9620270Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1279473a60>} 2025-05-07T20:32:49.9632978Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.9634122Z context = 2025-05-07T20:32:49.9634435Z 2025-05-07T20:32:49.9634616Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.9635181Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.9635681Z module_map=module_map) 2025-05-07T20:32:49.9636059Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.9636440Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.9636722Z E ^ 2025-05-07T20:32:49.9637222Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.9637907Z 2025-05-07T20:32:49.9638359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.9638923Z 2025-05-07T20:32:49.9639033Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:49.9639598Z self=, 2025-05-07T20:32:49.9640024Z T=16384, 2025-05-07T20:32:49.9640237Z D=5120, 2025-05-07T20:32:49.9640452Z scale_ub=1200.0, 2025-05-07T20:32:49.9640678Z contiguous=True, 2025-05-07T20:32:49.9640922Z compiled=True, 2025-05-07T20:32:49.9641150Z ) 2025-05-07T20:32:49.9641498Z self = 2025-05-07T20:32:49.9642020Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:49.9642319Z 2025-05-07T20:32:49.9642407Z @given( 2025-05-07T20:32:49.9642657Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:49.9642987Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:49.9643315Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:49.9643665Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:49.9644002Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:49.9644319Z ) 2025-05-07T20:32:49.9644689Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:49.9645165Z def test_silu_mul_quant( 2025-05-07T20:32:49.9645417Z self, 2025-05-07T20:32:49.9645633Z T: int, 2025-05-07T20:32:49.9645849Z D: int, 2025-05-07T20:32:49.9646075Z scale_ub: Optional[float], 2025-05-07T20:32:49.9646368Z contiguous: bool, 2025-05-07T20:32:49.9646625Z compiled: bool, 2025-05-07T20:32:49.9646854Z ) -> None: 2025-05-07T20:32:49.9647089Z torch.manual_seed(2025) 2025-05-07T20:32:49.9647343Z 2025-05-07T20:32:49.9647626Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:49.9647990Z 2025-05-07T20:32:49.9648193Z x_sign = torch.sign(x) 2025-05-07T20:32:49.9648486Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:49.9648811Z x = x_sign * x_clamp 2025-05-07T20:32:49.9649070Z x0 = x[:, :D] 2025-05-07T20:32:49.9649287Z x1 = x[:, D:] 2025-05-07T20:32:49.9649504Z 2025-05-07T20:32:49.9649702Z if contiguous: 2025-05-07T20:32:49.9649941Z x0 = x0.contiguous() 2025-05-07T20:32:49.9650213Z x1 = x1.contiguous() 2025-05-07T20:32:49.9650467Z 2025-05-07T20:32:49.9650668Z if scale_ub is not None: 2025-05-07T20:32:49.9650946Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:49.9651295Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:49.9651626Z ) 2025-05-07T20:32:49.9651818Z else: 2025-05-07T20:32:49.9652045Z scale_ub_tensor = None 2025-05-07T20:32:49.9652308Z 2025-05-07T20:32:49.9652552Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:49.9652884Z op = silu_mul_quant 2025-05-07T20:32:49.9653148Z if compiled: 2025-05-07T20:32:49.9653399Z op = torch.compile(op) 2025-05-07T20:32:49.9653720Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.9654015Z 2025-05-07T20:32:49.9654208Z > y_fp8, y_scale = fn() 2025-05-07T20:32:49.9654487Z 2025-05-07T20:32:49.9654591Z moe/activation_test.py:117: 2025-05-07T20:32:49.9654911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.9655264Z moe/activation_test.py:115: in fn 2025-05-07T20:32:49.9655555Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:49.9656156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:49.9656758Z return fn(*args, **kwargs) 2025-05-07T20:32:49.9658069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:49.9658815Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:49.9659385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:49.9660192Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:49.9660894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:49.9661467Z kernel = self.compile( 2025-05-07T20:32:49.9662046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:49.9662739Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:49.9663185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:49.9663461Z 2025-05-07T20:32:49.9663684Z self = 2025-05-07T20:32:49.9664822Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:49.9666282Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127919efc0>} 2025-05-07T20:32:49.9667702Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:49.9668801Z context = 2025-05-07T20:32:49.9669113Z 2025-05-07T20:32:49.9669283Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:49.9669837Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:49.9670323Z module_map=module_map) 2025-05-07T20:32:49.9670705Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:49.9671081Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:49.9671345Z E ^ 2025-05-07T20:32:49.9671838Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:49.9672324Z 2025-05-07T20:32:49.9672761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:49.9673307Z 2025-05-07T20:32:50.1373424Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.1374132Z self=, 2025-05-07T20:32:50.1374761Z T=16384, 2025-05-07T20:32:50.1374969Z D=5120, 2025-05-07T20:32:50.1375212Z scale_ub=None, 2025-05-07T20:32:50.1375447Z contiguous=False, 2025-05-07T20:32:50.1375686Z compiled=True, 2025-05-07T20:32:50.1375911Z ) 2025-05-07T20:32:50.1376258Z self = 2025-05-07T20:32:50.1376799Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:50.1377108Z 2025-05-07T20:32:50.1377196Z @given( 2025-05-07T20:32:50.1377448Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.1377773Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.1378103Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.1378458Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.1378811Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.1379111Z ) 2025-05-07T20:32:50.1379483Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.1380269Z def test_silu_mul_quant( 2025-05-07T20:32:50.1380519Z self, 2025-05-07T20:32:50.1380728Z T: int, 2025-05-07T20:32:50.1380940Z D: int, 2025-05-07T20:32:50.1381162Z scale_ub: Optional[float], 2025-05-07T20:32:50.1381449Z contiguous: bool, 2025-05-07T20:32:50.1381863Z compiled: bool, 2025-05-07T20:32:50.1382092Z ) -> None: 2025-05-07T20:32:50.1382318Z torch.manual_seed(2025) 2025-05-07T20:32:50.1382573Z 2025-05-07T20:32:50.1382855Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.1383227Z 2025-05-07T20:32:50.1383434Z x_sign = torch.sign(x) 2025-05-07T20:32:50.1383730Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.1384061Z x = x_sign * x_clamp 2025-05-07T20:32:50.1384316Z x0 = x[:, :D] 2025-05-07T20:32:50.1384547Z x1 = x[:, D:] 2025-05-07T20:32:50.1384758Z 2025-05-07T20:32:50.1384961Z if contiguous: 2025-05-07T20:32:50.1385216Z x0 = x0.contiguous() 2025-05-07T20:32:50.1385479Z x1 = x1.contiguous() 2025-05-07T20:32:50.1385733Z 2025-05-07T20:32:50.1385936Z if scale_ub is not None: 2025-05-07T20:32:50.1386212Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.1386576Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.1386908Z ) 2025-05-07T20:32:50.1387099Z else: 2025-05-07T20:32:50.1387324Z scale_ub_tensor = None 2025-05-07T20:32:50.1387594Z 2025-05-07T20:32:50.1387825Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.1388154Z op = silu_mul_quant 2025-05-07T20:32:50.1388420Z if compiled: 2025-05-07T20:32:50.1388670Z op = torch.compile(op) 2025-05-07T20:32:50.1388982Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.1389276Z 2025-05-07T20:32:50.1389493Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.1389662Z 2025-05-07T20:32:50.1389769Z moe/activation_test.py:117: 2025-05-07T20:32:50.1390083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.1390436Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.1390723Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.1391327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.1391929Z return fn(*args, **kwargs) 2025-05-07T20:32:50.1392630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.1393358Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.1393925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.1394652Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.1395368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.1395930Z kernel = self.compile( 2025-05-07T20:32:50.1396500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.1397205Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.1397614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.1397863Z 2025-05-07T20:32:50.1398077Z self = 2025-05-07T20:32:50.1399212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.1400755Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127a003a60>} 2025-05-07T20:32:50.1402190Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.1403441Z context = 2025-05-07T20:32:50.1403758Z 2025-05-07T20:32:50.1403933Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.1404493Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.1405039Z module_map=module_map) 2025-05-07T20:32:50.1405491Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.1405867Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.1406146Z E ^ 2025-05-07T20:32:50.1406634Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.1407128Z 2025-05-07T20:32:50.1407570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.1408132Z 2025-05-07T20:32:50.1408236Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.1408669Z self=, 2025-05-07T20:32:50.1409088Z T=2048, 2025-05-07T20:32:50.1409288Z D=5120, 2025-05-07T20:32:50.1409488Z scale_ub=None, 2025-05-07T20:32:50.1409704Z contiguous=False, 2025-05-07T20:32:50.1409943Z compiled=True, 2025-05-07T20:32:50.1410156Z ) 2025-05-07T20:32:50.4655036Z self = 2025-05-07T20:32:50.4655597Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:50.4655981Z 2025-05-07T20:32:50.4656092Z @given( 2025-05-07T20:32:50.4656431Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.4656915Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.4657344Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.4657805Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.4658252Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.4658623Z ) 2025-05-07T20:32:50.4659031Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.4659483Z def test_silu_mul_quant( 2025-05-07T20:32:50.4659722Z self, 2025-05-07T20:32:50.4659907Z T: int, 2025-05-07T20:32:50.4660097Z D: int, 2025-05-07T20:32:50.4660306Z scale_ub: Optional[float], 2025-05-07T20:32:50.4660572Z contiguous: bool, 2025-05-07T20:32:50.4660807Z compiled: bool, 2025-05-07T20:32:50.4661022Z ) -> None: 2025-05-07T20:32:50.4661225Z torch.manual_seed(2025) 2025-05-07T20:32:50.4661460Z 2025-05-07T20:32:50.4661736Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.4662081Z 2025-05-07T20:32:50.4662267Z x_sign = torch.sign(x) 2025-05-07T20:32:50.4662552Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.4662864Z x = x_sign * x_clamp 2025-05-07T20:32:50.4663122Z x0 = x[:, :D] 2025-05-07T20:32:50.4663327Z x1 = x[:, D:] 2025-05-07T20:32:50.4663525Z 2025-05-07T20:32:50.4663699Z if contiguous: 2025-05-07T20:32:50.4663920Z x0 = x0.contiguous() 2025-05-07T20:32:50.4664176Z x1 = x1.contiguous() 2025-05-07T20:32:50.4664412Z 2025-05-07T20:32:50.4664593Z if scale_ub is not None: 2025-05-07T20:32:50.4664859Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.4665191Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.4665496Z ) 2025-05-07T20:32:50.4665678Z else: 2025-05-07T20:32:50.4666084Z scale_ub_tensor = None 2025-05-07T20:32:50.4666340Z 2025-05-07T20:32:50.4666564Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.4666877Z op = silu_mul_quant 2025-05-07T20:32:50.4667120Z if compiled: 2025-05-07T20:32:50.4667494Z op = torch.compile(op) 2025-05-07T20:32:50.4667788Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.4668062Z 2025-05-07T20:32:50.4668242Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.4668407Z 2025-05-07T20:32:50.4668509Z moe/activation_test.py:117: 2025-05-07T20:32:50.4668803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.4669142Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.4669417Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.4669989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.4670568Z return fn(*args, **kwargs) 2025-05-07T20:32:50.4671253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.4671966Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.4672516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.4673229Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.4673957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.4674507Z kernel = self.compile( 2025-05-07T20:32:50.4675060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.4675738Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.4676138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.4676386Z 2025-05-07T20:32:50.4676591Z self = 2025-05-07T20:32:50.4677701Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.4679125Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1279cf6d40>} 2025-05-07T20:32:50.4680521Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.4681594Z context = 2025-05-07T20:32:50.4681893Z 2025-05-07T20:32:50.4682062Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.4682596Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.4683068Z module_map=module_map) 2025-05-07T20:32:50.4683435Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.4683791Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.4684044Z E ^ 2025-05-07T20:32:50.4684517Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.4684988Z 2025-05-07T20:32:50.4685421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.4685960Z 2025-05-07T20:32:50.4686062Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.4686474Z self=, 2025-05-07T20:32:50.4686996Z T=2048, 2025-05-07T20:32:50.4687180Z D=5120, 2025-05-07T20:32:50.4687363Z scale_ub=1200.0, 2025-05-07T20:32:50.4687577Z contiguous=False, 2025-05-07T20:32:50.4687795Z compiled=True, 2025-05-07T20:32:50.4687985Z ) 2025-05-07T20:32:50.4688388Z self = 2025-05-07T20:32:50.4688897Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:50.4689182Z 2025-05-07T20:32:50.4689263Z @given( 2025-05-07T20:32:50.4689479Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.4689788Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.4690089Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.4690411Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.4690736Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.4691018Z ) 2025-05-07T20:32:50.4691366Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.4691812Z def test_silu_mul_quant( 2025-05-07T20:32:50.4692045Z self, 2025-05-07T20:32:50.4692235Z T: int, 2025-05-07T20:32:50.4692422Z D: int, 2025-05-07T20:32:50.4692628Z scale_ub: Optional[float], 2025-05-07T20:32:50.4692899Z contiguous: bool, 2025-05-07T20:32:50.4693126Z compiled: bool, 2025-05-07T20:32:50.4693335Z ) -> None: 2025-05-07T20:32:50.4693539Z torch.manual_seed(2025) 2025-05-07T20:32:50.4693772Z 2025-05-07T20:32:50.4694037Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.4694459Z 2025-05-07T20:32:50.4694643Z x_sign = torch.sign(x) 2025-05-07T20:32:50.4694927Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.4695235Z x = x_sign * x_clamp 2025-05-07T20:32:50.4695471Z x0 = x[:, :D] 2025-05-07T20:32:50.4695686Z x1 = x[:, D:] 2025-05-07T20:32:50.4695901Z 2025-05-07T20:32:50.4696087Z if contiguous: 2025-05-07T20:32:50.4696317Z x0 = x0.contiguous() 2025-05-07T20:32:50.4696583Z x1 = x1.contiguous() 2025-05-07T20:32:50.4696828Z 2025-05-07T20:32:50.4697021Z if scale_ub is not None: 2025-05-07T20:32:50.4697304Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.4697641Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.4697953Z ) 2025-05-07T20:32:50.4698154Z else: 2025-05-07T20:32:50.4698370Z scale_ub_tensor = None 2025-05-07T20:32:50.4698627Z 2025-05-07T20:32:50.4698863Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.4699183Z op = silu_mul_quant 2025-05-07T20:32:50.4699430Z if compiled: 2025-05-07T20:32:50.4699681Z op = torch.compile(op) 2025-05-07T20:32:50.4699984Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.4700259Z 2025-05-07T20:32:50.4700460Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.4700623Z 2025-05-07T20:32:50.4700727Z moe/activation_test.py:117: 2025-05-07T20:32:50.4701023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.4701363Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.4701658Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.4702241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.4702825Z return fn(*args, **kwargs) 2025-05-07T20:32:50.4703515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.4704245Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.4704801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.4705607Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.4706311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.4706870Z kernel = self.compile( 2025-05-07T20:32:50.4707430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.4708204Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.4708622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.4708863Z 2025-05-07T20:32:50.4709081Z self = 2025-05-07T20:32:50.4710200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.4711645Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ac1a200>} 2025-05-07T20:32:50.4713057Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.4714151Z context = 2025-05-07T20:32:50.4714447Z 2025-05-07T20:32:50.4714616Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.4715153Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.4715637Z module_map=module_map) 2025-05-07T20:32:50.4716004Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.4716357Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.4716622Z E ^ 2025-05-07T20:32:50.4717103Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.4717574Z 2025-05-07T20:32:50.4718010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.4718567Z 2025-05-07T20:32:50.6463208Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.6464633Z self=, 2025-05-07T20:32:50.6465346Z T=4096, 2025-05-07T20:32:50.6465641Z D=5120, 2025-05-07T20:32:50.6465947Z scale_ub=1200.0, 2025-05-07T20:32:50.6466312Z contiguous=True, 2025-05-07T20:32:50.6466656Z compiled=True, 2025-05-07T20:32:50.6466989Z ) 2025-05-07T20:32:50.6467549Z self = 2025-05-07T20:32:50.6468367Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:50.6468813Z 2025-05-07T20:32:50.6468976Z @given( 2025-05-07T20:32:50.6469315Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.6469818Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.6470315Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.6470876Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.6471412Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.6471884Z ) 2025-05-07T20:32:50.6483231Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.6484093Z def test_silu_mul_quant( 2025-05-07T20:32:50.6484434Z self, 2025-05-07T20:32:50.6484698Z T: int, 2025-05-07T20:32:50.6484967Z D: int, 2025-05-07T20:32:50.6485250Z scale_ub: Optional[float], 2025-05-07T20:32:50.6485630Z contiguous: bool, 2025-05-07T20:32:50.6485956Z compiled: bool, 2025-05-07T20:32:50.6486259Z ) -> None: 2025-05-07T20:32:50.6486964Z torch.manual_seed(2025) 2025-05-07T20:32:50.6487322Z 2025-05-07T20:32:50.6487700Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.6488212Z 2025-05-07T20:32:50.6488502Z x_sign = torch.sign(x) 2025-05-07T20:32:50.6489201Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.6489703Z x = x_sign * x_clamp 2025-05-07T20:32:50.6490066Z x0 = x[:, :D] 2025-05-07T20:32:50.6490368Z x1 = x[:, D:] 2025-05-07T20:32:50.6490676Z 2025-05-07T20:32:50.6490955Z if contiguous: 2025-05-07T20:32:50.6491309Z x0 = x0.contiguous() 2025-05-07T20:32:50.6491706Z x1 = x1.contiguous() 2025-05-07T20:32:50.6492092Z 2025-05-07T20:32:50.6492386Z if scale_ub is not None: 2025-05-07T20:32:50.6492795Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.6493309Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.6493824Z ) 2025-05-07T20:32:50.6494138Z else: 2025-05-07T20:32:50.6494596Z scale_ub_tensor = None 2025-05-07T20:32:50.6495002Z 2025-05-07T20:32:50.6495376Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.6495927Z op = silu_mul_quant 2025-05-07T20:32:50.6496375Z if compiled: 2025-05-07T20:32:50.6496759Z op = torch.compile(op) 2025-05-07T20:32:50.6497276Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.6497765Z 2025-05-07T20:32:50.6498083Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.6498384Z 2025-05-07T20:32:50.6498544Z moe/activation_test.py:117: 2025-05-07T20:32:50.6499047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.6499626Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.6500105Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.6501116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.6502086Z return fn(*args, **kwargs) 2025-05-07T20:32:50.6503124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.6504362Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.6505234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.6506387Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.6507450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.6508270Z kernel = self.compile( 2025-05-07T20:32:50.6509125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.6510218Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.6510847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.6511230Z 2025-05-07T20:32:50.6511532Z self = 2025-05-07T20:32:50.6513367Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.6515809Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ac19080>} 2025-05-07T20:32:50.6518089Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.6519879Z context = 2025-05-07T20:32:50.6520419Z 2025-05-07T20:32:50.6520710Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.6521665Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.6522613Z module_map=module_map) 2025-05-07T20:32:50.6523253Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.6523911Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.6524357Z E ^ 2025-05-07T20:32:50.6525184Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.6526340Z 2025-05-07T20:32:50.6527116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.6528090Z 2025-05-07T20:32:50.6528269Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.6529030Z self=, 2025-05-07T20:32:50.6529746Z T=128, 2025-05-07T20:32:50.6530078Z D=5120, 2025-05-07T20:32:50.6530412Z scale_ub=1200.0, 2025-05-07T20:32:50.6530791Z contiguous=False, 2025-05-07T20:32:50.6531182Z compiled=True, 2025-05-07T20:32:50.6531530Z ) 2025-05-07T20:32:50.7544180Z self = 2025-05-07T20:32:50.7545211Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:50.7545708Z 2025-05-07T20:32:50.7545848Z @given( 2025-05-07T20:32:50.7546262Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.7546819Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.7547337Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.7547924Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.7548503Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.7549012Z ) 2025-05-07T20:32:50.7549661Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.7550465Z def test_silu_mul_quant( 2025-05-07T20:32:50.7550884Z self, 2025-05-07T20:32:50.7551203Z T: int, 2025-05-07T20:32:50.7551555Z D: int, 2025-05-07T20:32:50.7551932Z scale_ub: Optional[float], 2025-05-07T20:32:50.7552387Z contiguous: bool, 2025-05-07T20:32:50.7552803Z compiled: bool, 2025-05-07T20:32:50.7553197Z ) -> None: 2025-05-07T20:32:50.7553553Z torch.manual_seed(2025) 2025-05-07T20:32:50.7553984Z 2025-05-07T20:32:50.7554454Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.7555048Z 2025-05-07T20:32:50.7555352Z x_sign = torch.sign(x) 2025-05-07T20:32:50.7555816Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.7556330Z x = x_sign * x_clamp 2025-05-07T20:32:50.7556703Z x0 = x[:, :D] 2025-05-07T20:32:50.7557062Z x1 = x[:, D:] 2025-05-07T20:32:50.7557401Z 2025-05-07T20:32:50.7557694Z if contiguous: 2025-05-07T20:32:50.7558074Z x0 = x0.contiguous() 2025-05-07T20:32:50.7558541Z x1 = x1.contiguous() 2025-05-07T20:32:50.7558962Z 2025-05-07T20:32:50.7559318Z if scale_ub is not None: 2025-05-07T20:32:50.7559799Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.7560370Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.7560912Z ) 2025-05-07T20:32:50.7561244Z else: 2025-05-07T20:32:50.7561596Z scale_ub_tensor = None 2025-05-07T20:32:50.7562035Z 2025-05-07T20:32:50.7562427Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.7562959Z op = silu_mul_quant 2025-05-07T20:32:50.7563400Z if compiled: 2025-05-07T20:32:50.7563815Z op = torch.compile(op) 2025-05-07T20:32:50.7564320Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.7565257Z 2025-05-07T20:32:50.7565596Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.7565881Z 2025-05-07T20:32:50.7566060Z moe/activation_test.py:117: 2025-05-07T20:32:50.7566572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.7567380Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.7567873Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.7568883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.7569922Z return fn(*args, **kwargs) 2025-05-07T20:32:50.7571147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.7572433Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.7573403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.7574771Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.7576004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.7576941Z kernel = self.compile( 2025-05-07T20:32:50.7577704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.7578639Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.7579189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.7579508Z 2025-05-07T20:32:50.7579788Z self = 2025-05-07T20:32:50.7581318Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.7583307Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ad1f4c0>} 2025-05-07T20:32:50.7585332Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.7586884Z context = 2025-05-07T20:32:50.7587332Z 2025-05-07T20:32:50.7589007Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.7589783Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.7590487Z module_map=module_map) 2025-05-07T20:32:50.7590991Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.7591544Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.7591916Z E ^ 2025-05-07T20:32:50.7592628Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.7593375Z 2025-05-07T20:32:50.7594067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.7594924Z 2025-05-07T20:32:50.7595084Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.7595744Z self=, 2025-05-07T20:32:50.7596376Z T=16384, 2025-05-07T20:32:50.7596679Z D=7168, 2025-05-07T20:32:50.7596982Z scale_ub=1200.0, 2025-05-07T20:32:50.7597329Z contiguous=True, 2025-05-07T20:32:50.7597671Z compiled=True, 2025-05-07T20:32:50.7597994Z ) 2025-05-07T20:32:50.7598496Z self = 2025-05-07T20:32:50.7599420Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:50.7599884Z 2025-05-07T20:32:50.7600005Z @given( 2025-05-07T20:32:50.7600364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.7600852Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.7601471Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.7601995Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.7602505Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.7602959Z ) 2025-05-07T20:32:50.7603550Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.7604284Z def test_silu_mul_quant( 2025-05-07T20:32:50.7604653Z self, 2025-05-07T20:32:50.7604965Z T: int, 2025-05-07T20:32:50.7605275Z D: int, 2025-05-07T20:32:50.7605610Z scale_ub: Optional[float], 2025-05-07T20:32:50.7606037Z contiguous: bool, 2025-05-07T20:32:50.7606416Z compiled: bool, 2025-05-07T20:32:50.7606769Z ) -> None: 2025-05-07T20:32:50.7607111Z torch.manual_seed(2025) 2025-05-07T20:32:50.7607499Z 2025-05-07T20:32:50.7607913Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.7608461Z 2025-05-07T20:32:50.7608785Z x_sign = torch.sign(x) 2025-05-07T20:32:50.7609232Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.7609725Z x = x_sign * x_clamp 2025-05-07T20:32:50.7610099Z x0 = x[:, :D] 2025-05-07T20:32:50.7610423Z x1 = x[:, D:] 2025-05-07T20:32:50.7610753Z 2025-05-07T20:32:50.7611041Z if contiguous: 2025-05-07T20:32:50.7611408Z x0 = x0.contiguous() 2025-05-07T20:32:50.7611816Z x1 = x1.contiguous() 2025-05-07T20:32:50.7612201Z 2025-05-07T20:32:50.7612507Z if scale_ub is not None: 2025-05-07T20:32:50.7612926Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.7613481Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.7613996Z ) 2025-05-07T20:32:50.7614283Z else: 2025-05-07T20:32:50.7614741Z scale_ub_tensor = None 2025-05-07T20:32:50.7615155Z 2025-05-07T20:32:50.7615515Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.7616041Z op = silu_mul_quant 2025-05-07T20:32:50.7616457Z if compiled: 2025-05-07T20:32:50.7616851Z op = torch.compile(op) 2025-05-07T20:32:50.7617339Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.7617788Z 2025-05-07T20:32:50.7618121Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.7618386Z 2025-05-07T20:32:50.7618552Z moe/activation_test.py:117: 2025-05-07T20:32:50.7619031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.7619598Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.7620041Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.7620962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:50.7621887Z return fn(*args, **kwargs) 2025-05-07T20:32:50.7622958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.7624103Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.7624986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.7626521Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.7627599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.7628477Z kernel = self.compile( 2025-05-07T20:32:50.7629361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.7630632Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.7631274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.7631655Z 2025-05-07T20:32:50.7631975Z self = 2025-05-07T20:32:50.7633944Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.7636226Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ad1f880>} 2025-05-07T20:32:50.7638435Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.7640134Z context = 2025-05-07T20:32:50.7640611Z 2025-05-07T20:32:50.7640873Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.7641717Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.7642474Z module_map=module_map) 2025-05-07T20:32:50.7643057Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.7643626Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.7644029Z E ^ 2025-05-07T20:32:50.7644782Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.7645532Z 2025-05-07T20:32:50.7646215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.7647059Z 2025-05-07T20:32:50.8862825Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.8863699Z self=, 2025-05-07T20:32:50.8864423Z T=16384, 2025-05-07T20:32:50.8864753Z D=5120, 2025-05-07T20:32:50.8865083Z scale_ub=1200.0, 2025-05-07T20:32:50.8865464Z contiguous=True, 2025-05-07T20:32:50.8865844Z compiled=False, 2025-05-07T20:32:50.8866195Z ) 2025-05-07T20:32:50.8866741Z self = 2025-05-07T20:32:50.8867637Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:50.8868150Z 2025-05-07T20:32:50.8868277Z @given( 2025-05-07T20:32:50.8868667Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.8869207Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.8869741Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.8870325Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.8870908Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.8871417Z ) 2025-05-07T20:32:50.8872037Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.8872827Z def test_silu_mul_quant( 2025-05-07T20:32:50.8873254Z self, 2025-05-07T20:32:50.8873632Z T: int, 2025-05-07T20:32:50.8873955Z D: int, 2025-05-07T20:32:50.8874304Z scale_ub: Optional[float], 2025-05-07T20:32:50.8874746Z contiguous: bool, 2025-05-07T20:32:50.8875139Z compiled: bool, 2025-05-07T20:32:50.8875494Z ) -> None: 2025-05-07T20:32:50.8875850Z torch.manual_seed(2025) 2025-05-07T20:32:50.8876243Z 2025-05-07T20:32:50.8876679Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.8877268Z 2025-05-07T20:32:50.8877597Z x_sign = torch.sign(x) 2025-05-07T20:32:50.8878088Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.8878635Z x = x_sign * x_clamp 2025-05-07T20:32:50.8879469Z x0 = x[:, :D] 2025-05-07T20:32:50.8879837Z x1 = x[:, D:] 2025-05-07T20:32:50.8880198Z 2025-05-07T20:32:50.8880515Z if contiguous: 2025-05-07T20:32:50.8880900Z x0 = x0.contiguous() 2025-05-07T20:32:50.8881589Z x1 = x1.contiguous() 2025-05-07T20:32:50.8882001Z 2025-05-07T20:32:50.8882330Z if scale_ub is not None: 2025-05-07T20:32:50.8882800Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.8883369Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.8883913Z ) 2025-05-07T20:32:50.8884246Z else: 2025-05-07T20:32:50.8884594Z scale_ub_tensor = None 2025-05-07T20:32:50.8885036Z 2025-05-07T20:32:50.8885430Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.8885978Z op = silu_mul_quant 2025-05-07T20:32:50.8886411Z if compiled: 2025-05-07T20:32:50.8886837Z op = torch.compile(op) 2025-05-07T20:32:50.8887371Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.8887848Z 2025-05-07T20:32:50.8888173Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.8888460Z 2025-05-07T20:32:50.8888638Z moe/activation_test.py:117: 2025-05-07T20:32:50.8889159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.8889749Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.8890241Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.8891504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.8892788Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.8893825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.8895225Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.8896366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.8897120Z kernel = self.compile( 2025-05-07T20:32:50.8897870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.8898809Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.8899358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.8899685Z 2025-05-07T20:32:50.8899964Z self = 2025-05-07T20:32:50.8901536Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.8903648Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ad1c720>} 2025-05-07T20:32:50.8905693Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.8907238Z context = 2025-05-07T20:32:50.8907674Z 2025-05-07T20:32:50.8907904Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.8908702Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.8909441Z module_map=module_map) 2025-05-07T20:32:50.8910008Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.8910559Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.8910955Z E ^ 2025-05-07T20:32:50.8911842Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.8912593Z 2025-05-07T20:32:50.8913272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.8914213Z 2025-05-07T20:32:50.8914381Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:50.8915024Z self=, 2025-05-07T20:32:50.8915656Z T=1, 2025-05-07T20:32:50.8915941Z D=7168, 2025-05-07T20:32:50.8916234Z scale_ub=1200.0, 2025-05-07T20:32:50.8916581Z contiguous=False, 2025-05-07T20:32:50.8916929Z compiled=False, 2025-05-07T20:32:50.8917235Z ) 2025-05-07T20:32:50.8917731Z self = 2025-05-07T20:32:50.8918508Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:50.8918933Z 2025-05-07T20:32:50.8919057Z @given( 2025-05-07T20:32:50.8919404Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:50.8919893Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:50.8920372Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:50.8920880Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:50.8921402Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:50.8921854Z ) 2025-05-07T20:32:50.8922394Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:50.8923100Z def test_silu_mul_quant( 2025-05-07T20:32:50.8923483Z self, 2025-05-07T20:32:50.8923781Z T: int, 2025-05-07T20:32:50.8924075Z D: int, 2025-05-07T20:32:50.8924409Z scale_ub: Optional[float], 2025-05-07T20:32:50.8924833Z contiguous: bool, 2025-05-07T20:32:50.8925172Z compiled: bool, 2025-05-07T20:32:50.8925739Z ) -> None: 2025-05-07T20:32:50.8926075Z torch.manual_seed(2025) 2025-05-07T20:32:50.8926426Z 2025-05-07T20:32:50.8926865Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:50.8927410Z 2025-05-07T20:32:50.8927710Z x_sign = torch.sign(x) 2025-05-07T20:32:50.8928179Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:50.8928679Z x = x_sign * x_clamp 2025-05-07T20:32:50.8929042Z x0 = x[:, :D] 2025-05-07T20:32:50.8929397Z x1 = x[:, D:] 2025-05-07T20:32:50.8929717Z 2025-05-07T20:32:50.8929988Z if contiguous: 2025-05-07T20:32:50.8930358Z x0 = x0.contiguous() 2025-05-07T20:32:50.8930802Z x1 = x1.contiguous() 2025-05-07T20:32:50.8931212Z 2025-05-07T20:32:50.8931543Z if scale_ub is not None: 2025-05-07T20:32:50.8932024Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:50.8932610Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:50.8933145Z ) 2025-05-07T20:32:50.8933474Z else: 2025-05-07T20:32:50.8933850Z scale_ub_tensor = None 2025-05-07T20:32:50.8934283Z 2025-05-07T20:32:50.8934755Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:50.8935324Z op = silu_mul_quant 2025-05-07T20:32:50.8935756Z if compiled: 2025-05-07T20:32:50.8936195Z op = torch.compile(op) 2025-05-07T20:32:50.8936711Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.8937186Z 2025-05-07T20:32:50.8937512Z > y_fp8, y_scale = fn() 2025-05-07T20:32:50.8937800Z 2025-05-07T20:32:50.8937984Z moe/activation_test.py:117: 2025-05-07T20:32:50.8938492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.8939083Z moe/activation_test.py:115: in fn 2025-05-07T20:32:50.8939568Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:50.8940839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:50.8942367Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:50.8943376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:50.8944678Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:50.8946054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:50.8947031Z kernel = self.compile( 2025-05-07T20:32:50.8948010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:50.8961172Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:50.8961928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:50.8962350Z 2025-05-07T20:32:50.8962717Z self = 2025-05-07T20:32:50.8964718Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:50.8967276Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127b2f05e0>} 2025-05-07T20:32:50.8969797Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:50.8971703Z context = 2025-05-07T20:32:50.8972227Z 2025-05-07T20:32:50.8972516Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:50.8973465Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:50.8974320Z module_map=module_map) 2025-05-07T20:32:50.8975050Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:50.8975659Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:50.8976123Z E ^ 2025-05-07T20:32:50.8976945Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:50.8977783Z 2025-05-07T20:32:50.8978550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:50.8979524Z 2025-05-07T20:32:51.0715373Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.0716192Z self=, 2025-05-07T20:32:51.0716877Z T=4096, 2025-05-07T20:32:51.0717189Z D=7168, 2025-05-07T20:32:51.0717501Z scale_ub=1200.0, 2025-05-07T20:32:51.0717869Z contiguous=False, 2025-05-07T20:32:51.0718280Z compiled=True, 2025-05-07T20:32:51.0718608Z ) 2025-05-07T20:32:51.0719131Z self = 2025-05-07T20:32:51.0719952Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.0720398Z 2025-05-07T20:32:51.0720526Z @given( 2025-05-07T20:32:51.0720876Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.0721394Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.0721898Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.0722436Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.0722976Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.0723457Z ) 2025-05-07T20:32:51.0724044Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.0724796Z def test_silu_mul_quant( 2025-05-07T20:32:51.0725199Z self, 2025-05-07T20:32:51.0725792Z T: int, 2025-05-07T20:32:51.0727071Z D: int, 2025-05-07T20:32:51.0727448Z scale_ub: Optional[float], 2025-05-07T20:32:51.0727889Z contiguous: bool, 2025-05-07T20:32:51.0728284Z compiled: bool, 2025-05-07T20:32:51.0728650Z ) -> None: 2025-05-07T20:32:51.0729242Z torch.manual_seed(2025) 2025-05-07T20:32:51.0729634Z 2025-05-07T20:32:51.0730068Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.0730639Z 2025-05-07T20:32:51.0730945Z x_sign = torch.sign(x) 2025-05-07T20:32:51.0731418Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.0731930Z x = x_sign * x_clamp 2025-05-07T20:32:51.0732313Z x0 = x[:, :D] 2025-05-07T20:32:51.0732664Z x1 = x[:, D:] 2025-05-07T20:32:51.0733003Z 2025-05-07T20:32:51.0733292Z if contiguous: 2025-05-07T20:32:51.0733651Z x0 = x0.contiguous() 2025-05-07T20:32:51.0734075Z x1 = x1.contiguous() 2025-05-07T20:32:51.0734645Z 2025-05-07T20:32:51.0734966Z if scale_ub is not None: 2025-05-07T20:32:51.0735424Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.0735967Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.0736437Z ) 2025-05-07T20:32:51.0736701Z else: 2025-05-07T20:32:51.0736973Z scale_ub_tensor = None 2025-05-07T20:32:51.0737331Z 2025-05-07T20:32:51.0737652Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.0738095Z op = silu_mul_quant 2025-05-07T20:32:51.0738444Z if compiled: 2025-05-07T20:32:51.0738798Z op = torch.compile(op) 2025-05-07T20:32:51.0739235Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.0739639Z 2025-05-07T20:32:51.0739935Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.0740191Z 2025-05-07T20:32:51.0740355Z moe/activation_test.py:117: 2025-05-07T20:32:51.0740822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.0741327Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.0741766Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.0742640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.0743619Z return fn(*args, **kwargs) 2025-05-07T20:32:51.0744698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.0745867Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.0746771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.0748005Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.0749230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.0750199Z kernel = self.compile( 2025-05-07T20:32:51.0751171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.0752354Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.0753042Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.0753413Z 2025-05-07T20:32:51.0753780Z self = 2025-05-07T20:32:51.0755609Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.0757926Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127b0aafc0>} 2025-05-07T20:32:51.0760396Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.0762167Z context = 2025-05-07T20:32:51.0762775Z 2025-05-07T20:32:51.0763068Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.0763986Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.0764740Z module_map=module_map) 2025-05-07T20:32:51.0765268Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.0765848Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.0766304Z E ^ 2025-05-07T20:32:51.0767151Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.0767986Z 2025-05-07T20:32:51.0768760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.0769735Z 2025-05-07T20:32:51.0769914Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.0770650Z self=, 2025-05-07T20:32:51.0771384Z T=128, 2025-05-07T20:32:51.0771688Z D=7168, 2025-05-07T20:32:51.0772021Z scale_ub=1200.0, 2025-05-07T20:32:51.0772404Z contiguous=False, 2025-05-07T20:32:51.0772780Z compiled=True, 2025-05-07T20:32:51.0773124Z ) 2025-05-07T20:32:51.1696198Z self = 2025-05-07T20:32:51.1697171Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:51.1697673Z 2025-05-07T20:32:51.1697811Z @given( 2025-05-07T20:32:51.1698194Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.1698752Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.1699312Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.1699904Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.1700480Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.1700994Z ) 2025-05-07T20:32:51.1701610Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.1702393Z def test_silu_mul_quant( 2025-05-07T20:32:51.1702811Z self, 2025-05-07T20:32:51.1703135Z T: int, 2025-05-07T20:32:51.1703477Z D: int, 2025-05-07T20:32:51.1703873Z scale_ub: Optional[float], 2025-05-07T20:32:51.1704347Z contiguous: bool, 2025-05-07T20:32:51.1704748Z compiled: bool, 2025-05-07T20:32:51.1705133Z ) -> None: 2025-05-07T20:32:51.1705498Z torch.manual_seed(2025) 2025-05-07T20:32:51.1705910Z 2025-05-07T20:32:51.1706378Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.1706986Z 2025-05-07T20:32:51.1707306Z x_sign = torch.sign(x) 2025-05-07T20:32:51.1707782Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.1708291Z x = x_sign * x_clamp 2025-05-07T20:32:51.1708685Z x0 = x[:, :D] 2025-05-07T20:32:51.1709030Z x1 = x[:, D:] 2025-05-07T20:32:51.1709368Z 2025-05-07T20:32:51.1709665Z if contiguous: 2025-05-07T20:32:51.1710027Z x0 = x0.contiguous() 2025-05-07T20:32:51.1710456Z x1 = x1.contiguous() 2025-05-07T20:32:51.1710871Z 2025-05-07T20:32:51.1711185Z if scale_ub is not None: 2025-05-07T20:32:51.1711652Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.1712228Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.1712752Z ) 2025-05-07T20:32:51.1713078Z else: 2025-05-07T20:32:51.1713426Z scale_ub_tensor = None 2025-05-07T20:32:51.1713854Z 2025-05-07T20:32:51.1714658Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.1715210Z op = silu_mul_quant 2025-05-07T20:32:51.1715634Z if compiled: 2025-05-07T20:32:51.1716059Z op = torch.compile(op) 2025-05-07T20:32:51.1716563Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.1717269Z 2025-05-07T20:32:51.1717587Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.1717881Z 2025-05-07T20:32:51.1718053Z moe/activation_test.py:117: 2025-05-07T20:32:51.1718568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.1719147Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.1719639Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.1720661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.1721679Z return fn(*args, **kwargs) 2025-05-07T20:32:51.1722899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.1724178Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.1725156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.1726765Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.1727988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.1728968Z kernel = self.compile( 2025-05-07T20:32:51.1729933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.1730859Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.1731417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.1731737Z 2025-05-07T20:32:51.1732048Z self = 2025-05-07T20:32:51.1733562Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.1735750Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f129675d800>} 2025-05-07T20:32:51.1737791Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.1739334Z context = 2025-05-07T20:32:51.1739778Z 2025-05-07T20:32:51.1740060Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.1740866Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.1741606Z module_map=module_map) 2025-05-07T20:32:51.1742158Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.1742628Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.1742993Z E ^ 2025-05-07T20:32:51.1743692Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.1744351Z 2025-05-07T20:32:51.1744976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.1745740Z 2025-05-07T20:32:51.1745887Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.1746487Z self=, 2025-05-07T20:32:51.1747080Z T=2048, 2025-05-07T20:32:51.1747336Z D=7168, 2025-05-07T20:32:51.1747607Z scale_ub=None, 2025-05-07T20:32:51.1748109Z contiguous=True, 2025-05-07T20:32:51.1748425Z compiled=True, 2025-05-07T20:32:51.1748716Z ) 2025-05-07T20:32:51.1749173Z self = 2025-05-07T20:32:51.1749879Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:51.1750436Z 2025-05-07T20:32:51.1750544Z @given( 2025-05-07T20:32:51.1750865Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.1751314Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.1751746Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.1752221Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.1752694Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.1753098Z ) 2025-05-07T20:32:51.1753621Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.1754296Z def test_silu_mul_quant( 2025-05-07T20:32:51.1754645Z self, 2025-05-07T20:32:51.1754919Z T: int, 2025-05-07T20:32:51.1755208Z D: int, 2025-05-07T20:32:51.1755516Z scale_ub: Optional[float], 2025-05-07T20:32:51.1755894Z contiguous: bool, 2025-05-07T20:32:51.1756239Z compiled: bool, 2025-05-07T20:32:51.1756564Z ) -> None: 2025-05-07T20:32:51.1756857Z torch.manual_seed(2025) 2025-05-07T20:32:51.1757205Z 2025-05-07T20:32:51.1757587Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.1758073Z 2025-05-07T20:32:51.1758347Z x_sign = torch.sign(x) 2025-05-07T20:32:51.1758756Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.1759191Z x = x_sign * x_clamp 2025-05-07T20:32:51.1759533Z x0 = x[:, :D] 2025-05-07T20:32:51.1759842Z x1 = x[:, D:] 2025-05-07T20:32:51.1760127Z 2025-05-07T20:32:51.1760403Z if contiguous: 2025-05-07T20:32:51.1760736Z x0 = x0.contiguous() 2025-05-07T20:32:51.1761106Z x1 = x1.contiguous() 2025-05-07T20:32:51.1761468Z 2025-05-07T20:32:51.1761740Z if scale_ub is not None: 2025-05-07T20:32:51.1762118Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.1762589Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.1763038Z ) 2025-05-07T20:32:51.1763306Z else: 2025-05-07T20:32:51.1763591Z scale_ub_tensor = None 2025-05-07T20:32:51.1763947Z 2025-05-07T20:32:51.1764271Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.1764708Z op = silu_mul_quant 2025-05-07T20:32:51.1765069Z if compiled: 2025-05-07T20:32:51.1765422Z op = torch.compile(op) 2025-05-07T20:32:51.1765835Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.1766231Z 2025-05-07T20:32:51.1766501Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.1766733Z 2025-05-07T20:32:51.1766869Z moe/activation_test.py:117: 2025-05-07T20:32:51.1767299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.1767772Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.1768167Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.1768977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.1769812Z return fn(*args, **kwargs) 2025-05-07T20:32:51.1770784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.1771805Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.1772597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.1773651Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.1774848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.1775638Z kernel = self.compile( 2025-05-07T20:32:51.1776429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.1777396Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.1778056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.1778404Z 2025-05-07T20:32:51.1778689Z self = 2025-05-07T20:32:51.1780268Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.1782292Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1278524400>} 2025-05-07T20:32:51.1784282Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.1785752Z context = 2025-05-07T20:32:51.1786190Z 2025-05-07T20:32:51.1786424Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.1787190Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.1787872Z module_map=module_map) 2025-05-07T20:32:51.1788390Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.1788907Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.1789294Z E ^ 2025-05-07T20:32:51.1789995Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.1790682Z 2025-05-07T20:32:51.1791289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.1792171Z 2025-05-07T20:32:51.2402863Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.2403637Z self=, 2025-05-07T20:32:51.2404368Z T=16384, 2025-05-07T20:32:51.2404674Z D=5120, 2025-05-07T20:32:51.2404977Z scale_ub=None, 2025-05-07T20:32:51.2405321Z contiguous=False, 2025-05-07T20:32:51.2405673Z compiled=False, 2025-05-07T20:32:51.2406003Z ) 2025-05-07T20:32:51.2406514Z self = 2025-05-07T20:32:51.2407349Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.2407847Z 2025-05-07T20:32:51.2407975Z @given( 2025-05-07T20:32:51.2408342Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.2408855Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.2409358Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.2409917Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.2410471Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.2410942Z ) 2025-05-07T20:32:51.2411496Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.2412244Z def test_silu_mul_quant( 2025-05-07T20:32:51.2412625Z self, 2025-05-07T20:32:51.2412940Z T: int, 2025-05-07T20:32:51.2413261Z D: int, 2025-05-07T20:32:51.2413604Z scale_ub: Optional[float], 2025-05-07T20:32:51.2414062Z contiguous: bool, 2025-05-07T20:32:51.2414634Z compiled: bool, 2025-05-07T20:32:51.2415004Z ) -> None: 2025-05-07T20:32:51.2415363Z torch.manual_seed(2025) 2025-05-07T20:32:51.2415771Z 2025-05-07T20:32:51.2416588Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.2417205Z 2025-05-07T20:32:51.2417535Z x_sign = torch.sign(x) 2025-05-07T20:32:51.2418023Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.2421759Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.2425364Z 2025-05-07T20:32:51.2425935Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:51.2426293Z 2025-05-07T20:32:51.2426455Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.2427151Z self=, 2025-05-07T20:32:51.2427785Z T=4096, 2025-05-07T20:32:51.2428072Z D=7168, 2025-05-07T20:32:51.2428370Z scale_ub=1200.0, 2025-05-07T20:32:51.2428727Z contiguous=True, 2025-05-07T20:32:51.2429078Z compiled=True, 2025-05-07T20:32:51.2429406Z ) 2025-05-07T20:32:51.2429928Z self = 2025-05-07T20:32:51.2430715Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:51.2431175Z 2025-05-07T20:32:51.2431301Z @given( 2025-05-07T20:32:51.2431677Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.2432185Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.2432696Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.2433240Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.2433792Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.2434260Z ) 2025-05-07T20:32:51.2434844Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.2435623Z def test_silu_mul_quant( 2025-05-07T20:32:51.2436031Z self, 2025-05-07T20:32:51.2436352Z T: int, 2025-05-07T20:32:51.2436679Z D: int, 2025-05-07T20:32:51.2437053Z scale_ub: Optional[float], 2025-05-07T20:32:51.2437508Z contiguous: bool, 2025-05-07T20:32:51.2437915Z compiled: bool, 2025-05-07T20:32:51.2438291Z ) -> None: 2025-05-07T20:32:51.2438646Z torch.manual_seed(2025) 2025-05-07T20:32:51.2439068Z 2025-05-07T20:32:51.2439533Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.2440135Z 2025-05-07T20:32:51.2440454Z x_sign = torch.sign(x) 2025-05-07T20:32:51.2440939Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.2444729Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.2448293Z 2025-05-07T20:32:51.2448500Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:51.2448859Z 2025-05-07T20:32:51.2449024Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.2449719Z self=, 2025-05-07T20:32:51.2450398Z T=16384, 2025-05-07T20:32:51.2450705Z D=7168, 2025-05-07T20:32:51.2451021Z scale_ub=None, 2025-05-07T20:32:51.2451365Z contiguous=False, 2025-05-07T20:32:51.2451954Z compiled=False, 2025-05-07T20:32:51.2452311Z ) 2025-05-07T20:32:51.2452860Z self = 2025-05-07T20:32:51.2453744Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.2454520Z 2025-05-07T20:32:51.2454656Z @given( 2025-05-07T20:32:51.2455039Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.2455586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.2456110Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.2456692Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.2457274Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.2457757Z ) 2025-05-07T20:32:51.2458371Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.2459167Z def test_silu_mul_quant( 2025-05-07T20:32:51.2459572Z self, 2025-05-07T20:32:51.2459913Z T: int, 2025-05-07T20:32:51.2460243Z D: int, 2025-05-07T20:32:51.2460596Z scale_ub: Optional[float], 2025-05-07T20:32:51.2461072Z contiguous: bool, 2025-05-07T20:32:51.2461481Z compiled: bool, 2025-05-07T20:32:51.2461867Z ) -> None: 2025-05-07T20:32:51.2462239Z torch.manual_seed(2025) 2025-05-07T20:32:51.2462646Z 2025-05-07T20:32:51.2463106Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.2466993Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.2470538Z 2025-05-07T20:32:51.2470720Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.2471014Z 2025-05-07T20:32:51.2471168Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.2471787Z self=, 2025-05-07T20:32:51.2484419Z T=2048, 2025-05-07T20:32:51.2484755Z D=7168, 2025-05-07T20:32:51.2485051Z scale_ub=1200.0, 2025-05-07T20:32:51.2485386Z contiguous=True, 2025-05-07T20:32:51.2485737Z compiled=True, 2025-05-07T20:32:51.2486054Z ) 2025-05-07T20:32:51.2486519Z self = 2025-05-07T20:32:51.2487251Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:51.2487693Z 2025-05-07T20:32:51.2487812Z @given( 2025-05-07T20:32:51.2488171Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.2488663Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.2489143Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.2489667Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.2490183Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.2490638Z ) 2025-05-07T20:32:51.2491184Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.2491908Z def test_silu_mul_quant( 2025-05-07T20:32:51.2492278Z self, 2025-05-07T20:32:51.2492584Z T: int, 2025-05-07T20:32:51.2492884Z D: int, 2025-05-07T20:32:51.2493210Z scale_ub: Optional[float], 2025-05-07T20:32:51.2493637Z contiguous: bool, 2025-05-07T20:32:51.2494012Z compiled: bool, 2025-05-07T20:32:51.2494343Z ) -> None: 2025-05-07T20:32:51.2494852Z torch.manual_seed(2025) 2025-05-07T20:32:51.2495227Z 2025-05-07T20:32:51.2495805Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.2496366Z 2025-05-07T20:32:51.2496665Z x_sign = torch.sign(x) 2025-05-07T20:32:51.2497109Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.2500437Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.2503737Z 2025-05-07T20:32:51.2503935Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:51.2504283Z 2025-05-07T20:32:51.2504445Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.2505118Z self=, 2025-05-07T20:32:51.2505782Z T=2048, 2025-05-07T20:32:51.2506099Z D=7168, 2025-05-07T20:32:51.2506421Z scale_ub=None, 2025-05-07T20:32:51.2506773Z contiguous=True, 2025-05-07T20:32:51.2507156Z compiled=False, 2025-05-07T20:32:51.2507502Z ) 2025-05-07T20:32:51.3630647Z self = 2025-05-07T20:32:51.3631546Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.3632003Z 2025-05-07T20:32:51.3632129Z @given( 2025-05-07T20:32:51.3632506Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3633031Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3633545Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3634112Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3634655Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3635129Z ) 2025-05-07T20:32:51.3635707Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3636461Z def test_silu_mul_quant( 2025-05-07T20:32:51.3636868Z self, 2025-05-07T20:32:51.3637196Z T: int, 2025-05-07T20:32:51.3637519Z D: int, 2025-05-07T20:32:51.3637872Z scale_ub: Optional[float], 2025-05-07T20:32:51.3638300Z contiguous: bool, 2025-05-07T20:32:51.3638701Z compiled: bool, 2025-05-07T20:32:51.3639076Z ) -> None: 2025-05-07T20:32:51.3639423Z torch.manual_seed(2025) 2025-05-07T20:32:51.3639832Z 2025-05-07T20:32:51.3640276Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3640840Z 2025-05-07T20:32:51.3641152Z > x_sign = torch.sign(x) 2025-05-07T20:32:51.3644625Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.3647877Z 2025-05-07T20:32:51.3648088Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:51.3648433Z 2025-05-07T20:32:51.3648613Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3649291Z self=, 2025-05-07T20:32:51.3649943Z T=1, 2025-05-07T20:32:51.3650192Z D=7168, 2025-05-07T20:32:51.3650444Z scale_ub=1200.0, 2025-05-07T20:32:51.3650757Z contiguous=True, 2025-05-07T20:32:51.3651083Z compiled=False, 2025-05-07T20:32:51.3651391Z ) 2025-05-07T20:32:51.3652367Z self = 2025-05-07T20:32:51.3653202Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:51.3653656Z 2025-05-07T20:32:51.3653778Z @given( 2025-05-07T20:32:51.3654138Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.3655000Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.3655505Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.3656007Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.3656541Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.3657019Z ) 2025-05-07T20:32:51.3657630Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.3658386Z def test_silu_mul_quant( 2025-05-07T20:32:51.3658805Z self, 2025-05-07T20:32:51.3659125Z T: int, 2025-05-07T20:32:51.3659455Z D: int, 2025-05-07T20:32:51.3659820Z scale_ub: Optional[float], 2025-05-07T20:32:51.3660279Z contiguous: bool, 2025-05-07T20:32:51.3660698Z compiled: bool, 2025-05-07T20:32:51.3661082Z ) -> None: 2025-05-07T20:32:51.3661441Z torch.manual_seed(2025) 2025-05-07T20:32:51.3661870Z 2025-05-07T20:32:51.3662333Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.3662923Z 2025-05-07T20:32:51.3663230Z x_sign = torch.sign(x) 2025-05-07T20:32:51.3663746Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.3664230Z x = x_sign * x_clamp 2025-05-07T20:32:51.3664631Z x0 = x[:, :D] 2025-05-07T20:32:51.3665000Z x1 = x[:, D:] 2025-05-07T20:32:51.3665336Z 2025-05-07T20:32:51.3665628Z if contiguous: 2025-05-07T20:32:51.3665995Z x0 = x0.contiguous() 2025-05-07T20:32:51.3666406Z x1 = x1.contiguous() 2025-05-07T20:32:51.3666783Z 2025-05-07T20:32:51.3667096Z if scale_ub is not None: 2025-05-07T20:32:51.3667541Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.3668057Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.3668544Z ) 2025-05-07T20:32:51.3668863Z else: 2025-05-07T20:32:51.3669197Z scale_ub_tensor = None 2025-05-07T20:32:51.3669624Z 2025-05-07T20:32:51.3670008Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.3670533Z op = silu_mul_quant 2025-05-07T20:32:51.3670936Z if compiled: 2025-05-07T20:32:51.3671338Z op = torch.compile(op) 2025-05-07T20:32:51.3671823Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3672275Z 2025-05-07T20:32:51.3672594Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.3672851Z 2025-05-07T20:32:51.3673019Z moe/activation_test.py:117: 2025-05-07T20:32:51.3673494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3674078Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.3674569Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.3675657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.3676824Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.3677816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.3679073Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.3680263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.3681246Z kernel = self.compile( 2025-05-07T20:32:51.3682237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.3683440Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.3684293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.3684730Z 2025-05-07T20:32:51.3685089Z self = 2025-05-07T20:32:51.3687085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.3689570Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f121ba10f40>} 2025-05-07T20:32:51.3691998Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.3693952Z context = 2025-05-07T20:32:51.3694605Z 2025-05-07T20:32:51.3694896Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.3695838Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.3696672Z module_map=module_map) 2025-05-07T20:32:51.3697309Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.3697933Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.3698386Z E ^ 2025-05-07T20:32:51.3699235Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.3700085Z 2025-05-07T20:32:51.3700857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.3701820Z 2025-05-07T20:32:51.3702010Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.3702751Z self=, 2025-05-07T20:32:51.3703473Z T=128, 2025-05-07T20:32:51.3703797Z D=5120, 2025-05-07T20:32:51.3704118Z scale_ub=None, 2025-05-07T20:32:51.3704486Z contiguous=True, 2025-05-07T20:32:51.3704869Z compiled=False, 2025-05-07T20:32:51.3705226Z ) 2025-05-07T20:32:51.4384163Z self = 2025-05-07T20:32:51.4385072Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.4385530Z 2025-05-07T20:32:51.4385656Z @given( 2025-05-07T20:32:51.4386029Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4386542Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4387037Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4387579Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4388142Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4388626Z ) 2025-05-07T20:32:51.4389240Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4389985Z def test_silu_mul_quant( 2025-05-07T20:32:51.4390374Z self, 2025-05-07T20:32:51.4390694Z T: int, 2025-05-07T20:32:51.4391033Z D: int, 2025-05-07T20:32:51.4391395Z scale_ub: Optional[float], 2025-05-07T20:32:51.4391834Z contiguous: bool, 2025-05-07T20:32:51.4392210Z compiled: bool, 2025-05-07T20:32:51.4392530Z ) -> None: 2025-05-07T20:32:51.4392832Z torch.manual_seed(2025) 2025-05-07T20:32:51.4393200Z 2025-05-07T20:32:51.4393627Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4394179Z 2025-05-07T20:32:51.4394492Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4394956Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4395454Z x = x_sign * x_clamp 2025-05-07T20:32:51.4395848Z x0 = x[:, :D] 2025-05-07T20:32:51.4396595Z x1 = x[:, D:] 2025-05-07T20:32:51.4396949Z 2025-05-07T20:32:51.4397243Z if contiguous: 2025-05-07T20:32:51.4397644Z x0 = x0.contiguous() 2025-05-07T20:32:51.4398080Z x1 = x1.contiguous() 2025-05-07T20:32:51.4398494Z 2025-05-07T20:32:51.4399074Z if scale_ub is not None: 2025-05-07T20:32:51.4399550Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4400123Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4400667Z ) 2025-05-07T20:32:51.4400989Z else: 2025-05-07T20:32:51.4401335Z scale_ub_tensor = None 2025-05-07T20:32:51.4401769Z 2025-05-07T20:32:51.4402160Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4402699Z op = silu_mul_quant 2025-05-07T20:32:51.4403131Z if compiled: 2025-05-07T20:32:51.4403544Z op = torch.compile(op) 2025-05-07T20:32:51.4404079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4404536Z 2025-05-07T20:32:51.4404843Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4405106Z 2025-05-07T20:32:51.4405272Z moe/activation_test.py:117: 2025-05-07T20:32:51.4405781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4406345Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4406811Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4407960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4409138Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4410055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4411220Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4412374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4413313Z kernel = self.compile( 2025-05-07T20:32:51.4414271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4415575Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4416292Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4416707Z 2025-05-07T20:32:51.4417072Z self = 2025-05-07T20:32:51.4419040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4421591Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f121ba12020>} 2025-05-07T20:32:51.4424155Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4426434Z context = 2025-05-07T20:32:51.4426951Z 2025-05-07T20:32:51.4427230Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4428101Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4428901Z module_map=module_map) 2025-05-07T20:32:51.4429497Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4430093Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4430544Z E ^ 2025-05-07T20:32:51.4431380Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4432424Z 2025-05-07T20:32:51.4433213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4434176Z 2025-05-07T20:32:51.4434355Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4435248Z self=, 2025-05-07T20:32:51.4435964Z T=128, 2025-05-07T20:32:51.4436282Z D=7168, 2025-05-07T20:32:51.4436595Z scale_ub=None, 2025-05-07T20:32:51.4436953Z contiguous=True, 2025-05-07T20:32:51.4437331Z compiled=False, 2025-05-07T20:32:51.4437677Z ) 2025-05-07T20:32:51.4438236Z self = 2025-05-07T20:32:51.4439117Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.4439602Z 2025-05-07T20:32:51.4439733Z @given( 2025-05-07T20:32:51.4440121Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.4440682Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.4441228Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.4441798Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.4442380Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.4442892Z ) 2025-05-07T20:32:51.4443502Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.4444345Z def test_silu_mul_quant( 2025-05-07T20:32:51.4444760Z self, 2025-05-07T20:32:51.4445074Z T: int, 2025-05-07T20:32:51.4445402Z D: int, 2025-05-07T20:32:51.4445771Z scale_ub: Optional[float], 2025-05-07T20:32:51.4446230Z contiguous: bool, 2025-05-07T20:32:51.4446635Z compiled: bool, 2025-05-07T20:32:51.4447014Z ) -> None: 2025-05-07T20:32:51.4447363Z torch.manual_seed(2025) 2025-05-07T20:32:51.4447779Z 2025-05-07T20:32:51.4448250Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.4448853Z 2025-05-07T20:32:51.4449140Z x_sign = torch.sign(x) 2025-05-07T20:32:51.4449537Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.4449955Z x = x_sign * x_clamp 2025-05-07T20:32:51.4450292Z x0 = x[:, :D] 2025-05-07T20:32:51.4450596Z x1 = x[:, D:] 2025-05-07T20:32:51.4450903Z 2025-05-07T20:32:51.4451168Z if contiguous: 2025-05-07T20:32:51.4451503Z x0 = x0.contiguous() 2025-05-07T20:32:51.4451891Z x1 = x1.contiguous() 2025-05-07T20:32:51.4452213Z 2025-05-07T20:32:51.4452489Z if scale_ub is not None: 2025-05-07T20:32:51.4452895Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.4453400Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.4453956Z ) 2025-05-07T20:32:51.4454270Z else: 2025-05-07T20:32:51.4454756Z scale_ub_tensor = None 2025-05-07T20:32:51.4455155Z 2025-05-07T20:32:51.4455543Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.4456090Z op = silu_mul_quant 2025-05-07T20:32:51.4456494Z if compiled: 2025-05-07T20:32:51.4456885Z op = torch.compile(op) 2025-05-07T20:32:51.4457392Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4457851Z 2025-05-07T20:32:51.4458174Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.4458456Z 2025-05-07T20:32:51.4458620Z moe/activation_test.py:117: 2025-05-07T20:32:51.4459117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4459709Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.4460181Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.4461412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.4462609Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.4463627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.4464750Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.4465760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.4466762Z kernel = self.compile( 2025-05-07T20:32:51.4467636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.4468726Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.4469351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.4469721Z 2025-05-07T20:32:51.4470022Z self = 2025-05-07T20:32:51.4471821Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.4474193Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f121ba12f20>} 2025-05-07T20:32:51.4476439Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.4478136Z context = 2025-05-07T20:32:51.4478611Z 2025-05-07T20:32:51.4478873Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.4479709Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.4480456Z module_map=module_map) 2025-05-07T20:32:51.4481037Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.4481604Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.4482008Z E ^ 2025-05-07T20:32:51.4482764Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.4483535Z 2025-05-07T20:32:51.4484297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.4485254Z 2025-05-07T20:32:51.4485439Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.4486166Z self=, 2025-05-07T20:32:51.4486868Z T=2048, 2025-05-07T20:32:51.4487187Z D=7168, 2025-05-07T20:32:51.4487500Z scale_ub=1200.0, 2025-05-07T20:32:51.4487876Z contiguous=True, 2025-05-07T20:32:51.4488241Z compiled=False, 2025-05-07T20:32:51.4488588Z ) 2025-05-07T20:32:51.5299673Z self = 2025-05-07T20:32:51.5300584Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:51.5301049Z 2025-05-07T20:32:51.5301174Z @given( 2025-05-07T20:32:51.5301558Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.5302071Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.5302560Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.5303105Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.5303664Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.5304130Z ) 2025-05-07T20:32:51.5304718Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.5305461Z def test_silu_mul_quant( 2025-05-07T20:32:51.5305849Z self, 2025-05-07T20:32:51.5306150Z T: int, 2025-05-07T20:32:51.5306469Z D: int, 2025-05-07T20:32:51.5307288Z scale_ub: Optional[float], 2025-05-07T20:32:51.5307719Z contiguous: bool, 2025-05-07T20:32:51.5308046Z compiled: bool, 2025-05-07T20:32:51.5308364Z ) -> None: 2025-05-07T20:32:51.5308657Z torch.manual_seed(2025) 2025-05-07T20:32:51.5309010Z 2025-05-07T20:32:51.5309699Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.5313262Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.5316852Z 2025-05-07T20:32:51.5317082Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.5317468Z 2025-05-07T20:32:51.5317638Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.5318373Z self=, 2025-05-07T20:32:51.5319094Z T=1, 2025-05-07T20:32:51.5319388Z D=5120, 2025-05-07T20:32:51.5319710Z scale_ub=1200.0, 2025-05-07T20:32:51.5320068Z contiguous=True, 2025-05-07T20:32:51.5320406Z compiled=False, 2025-05-07T20:32:51.5320763Z ) 2025-05-07T20:32:51.5321315Z self = 2025-05-07T20:32:51.5322110Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:51.5322541Z 2025-05-07T20:32:51.5322658Z @given( 2025-05-07T20:32:51.5323019Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.5323542Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.5324072Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.5324631Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.5325165Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.5325978Z ) 2025-05-07T20:32:51.5326558Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.5327312Z def test_silu_mul_quant( 2025-05-07T20:32:51.5327699Z self, 2025-05-07T20:32:51.5328011Z T: int, 2025-05-07T20:32:51.5328320Z D: int, 2025-05-07T20:32:51.5328665Z scale_ub: Optional[float], 2025-05-07T20:32:51.5329126Z contiguous: bool, 2025-05-07T20:32:51.5329536Z compiled: bool, 2025-05-07T20:32:51.5329898Z ) -> None: 2025-05-07T20:32:51.5330263Z torch.manual_seed(2025) 2025-05-07T20:32:51.5330650Z 2025-05-07T20:32:51.5331076Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.5331676Z 2025-05-07T20:32:51.5332013Z x_sign = torch.sign(x) 2025-05-07T20:32:51.5332524Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.5333062Z x = x_sign * x_clamp 2025-05-07T20:32:51.5333465Z x0 = x[:, :D] 2025-05-07T20:32:51.5333879Z x1 = x[:, D:] 2025-05-07T20:32:51.5334224Z 2025-05-07T20:32:51.5334647Z if contiguous: 2025-05-07T20:32:51.5348099Z x0 = x0.contiguous() 2025-05-07T20:32:51.5348601Z x1 = x1.contiguous() 2025-05-07T20:32:51.5349035Z 2025-05-07T20:32:51.5349369Z if scale_ub is not None: 2025-05-07T20:32:51.5349844Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.5350433Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.5350976Z ) 2025-05-07T20:32:51.5351294Z else: 2025-05-07T20:32:51.5351651Z scale_ub_tensor = None 2025-05-07T20:32:51.5352098Z 2025-05-07T20:32:51.5352485Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.5353039Z op = silu_mul_quant 2025-05-07T20:32:51.5353691Z if compiled: 2025-05-07T20:32:51.5354121Z op = torch.compile(op) 2025-05-07T20:32:51.5354637Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.5355117Z 2025-05-07T20:32:51.5355434Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.5355899Z 2025-05-07T20:32:51.5356072Z moe/activation_test.py:117: 2025-05-07T20:32:51.5356588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.5357175Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.5357655Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.5358920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.5360198Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.5361167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.5362438Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.5363658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.5364478Z kernel = self.compile( 2025-05-07T20:32:51.5365310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.5366344Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.5366953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.5367299Z 2025-05-07T20:32:51.5367592Z self = 2025-05-07T20:32:51.5369186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.5371247Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f121b9384a0>} 2025-05-07T20:32:51.5373254Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.5374906Z context = 2025-05-07T20:32:51.5375333Z 2025-05-07T20:32:51.5375571Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.5376329Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.5377017Z module_map=module_map) 2025-05-07T20:32:51.5377531Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.5378031Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.5378392Z E ^ 2025-05-07T20:32:51.5379072Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.5379747Z 2025-05-07T20:32:51.5380367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.5381142Z 2025-05-07T20:32:51.5381290Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.5381885Z self=, 2025-05-07T20:32:51.5382468Z T=2048, 2025-05-07T20:32:51.5382726Z D=5120, 2025-05-07T20:32:51.5383012Z scale_ub=None, 2025-05-07T20:32:51.5383323Z contiguous=True, 2025-05-07T20:32:51.5383659Z compiled=False, 2025-05-07T20:32:51.5383976Z ) 2025-05-07T20:32:51.5384428Z self = 2025-05-07T20:32:51.5385271Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.5385682Z 2025-05-07T20:32:51.5385794Z @given( 2025-05-07T20:32:51.5386114Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.5386563Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.5387082Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.5387572Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.5388059Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.5388464Z ) 2025-05-07T20:32:51.5388991Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.5389688Z def test_silu_mul_quant( 2025-05-07T20:32:51.5390047Z self, 2025-05-07T20:32:51.5390365Z T: int, 2025-05-07T20:32:51.5390685Z D: int, 2025-05-07T20:32:51.5391036Z scale_ub: Optional[float], 2025-05-07T20:32:51.5391512Z contiguous: bool, 2025-05-07T20:32:51.5391920Z compiled: bool, 2025-05-07T20:32:51.5392291Z ) -> None: 2025-05-07T20:32:51.5392636Z torch.manual_seed(2025) 2025-05-07T20:32:51.5393036Z 2025-05-07T20:32:51.5393475Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.5394088Z 2025-05-07T20:32:51.5394406Z > x_sign = torch.sign(x) 2025-05-07T20:32:51.5397796Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.5401035Z 2025-05-07T20:32:51.5401260Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:51.5401624Z 2025-05-07T20:32:51.5401791Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.5402476Z self=, 2025-05-07T20:32:51.5403157Z T=16384, 2025-05-07T20:32:51.5403487Z D=5120, 2025-05-07T20:32:51.5403781Z scale_ub=None, 2025-05-07T20:32:51.5404139Z contiguous=True, 2025-05-07T20:32:51.5404495Z compiled=False, 2025-05-07T20:32:51.5404810Z ) 2025-05-07T20:32:51.6157216Z self = 2025-05-07T20:32:51.6158094Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.6158571Z 2025-05-07T20:32:51.6158705Z @given( 2025-05-07T20:32:51.6159088Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.6159605Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.6160121Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.6160699Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.6161232Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.6161714Z ) 2025-05-07T20:32:51.6162306Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.6163067Z def test_silu_mul_quant( 2025-05-07T20:32:51.6163474Z self, 2025-05-07T20:32:51.6163847Z T: int, 2025-05-07T20:32:51.6164161Z D: int, 2025-05-07T20:32:51.6164518Z scale_ub: Optional[float], 2025-05-07T20:32:51.6164960Z contiguous: bool, 2025-05-07T20:32:51.6165347Z compiled: bool, 2025-05-07T20:32:51.6165703Z ) -> None: 2025-05-07T20:32:51.6166026Z torch.manual_seed(2025) 2025-05-07T20:32:51.6166395Z 2025-05-07T20:32:51.6166792Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.6170651Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.6174509Z 2025-05-07T20:32:51.6174724Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.6175101Z 2025-05-07T20:32:51.6175283Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.6175999Z self=, 2025-05-07T20:32:51.6176723Z T=4096, 2025-05-07T20:32:51.6177035Z D=5120, 2025-05-07T20:32:51.6177352Z scale_ub=None, 2025-05-07T20:32:51.6177704Z contiguous=True, 2025-05-07T20:32:51.6178055Z compiled=False, 2025-05-07T20:32:51.6178373Z ) 2025-05-07T20:32:51.6178926Z self = 2025-05-07T20:32:51.6179773Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.6180220Z 2025-05-07T20:32:51.6180371Z @given( 2025-05-07T20:32:51.6180733Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.6181235Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.6181731Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.6182255Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.6182797Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.6183280Z ) 2025-05-07T20:32:51.6183861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.6184600Z def test_silu_mul_quant( 2025-05-07T20:32:51.6185002Z self, 2025-05-07T20:32:51.6185330Z T: int, 2025-05-07T20:32:51.6185640Z D: int, 2025-05-07T20:32:51.6186000Z scale_ub: Optional[float], 2025-05-07T20:32:51.6186472Z contiguous: bool, 2025-05-07T20:32:51.6186877Z compiled: bool, 2025-05-07T20:32:51.6187269Z ) -> None: 2025-05-07T20:32:51.6187636Z torch.manual_seed(2025) 2025-05-07T20:32:51.6188037Z 2025-05-07T20:32:51.6188477Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.6192299Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.6195879Z 2025-05-07T20:32:51.6196079Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.6196459Z 2025-05-07T20:32:51.6196642Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.6197370Z self=, 2025-05-07T20:32:51.6198095Z T=2048, 2025-05-07T20:32:51.6198414Z D=5120, 2025-05-07T20:32:51.6198773Z scale_ub=None, 2025-05-07T20:32:51.6199128Z contiguous=False, 2025-05-07T20:32:51.6199509Z compiled=False, 2025-05-07T20:32:51.6199860Z ) 2025-05-07T20:32:51.6200391Z self = 2025-05-07T20:32:51.6201218Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.6201684Z 2025-05-07T20:32:51.6201817Z @given( 2025-05-07T20:32:51.6202188Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.6202702Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.6203369Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.6204011Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.6204592Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.6205095Z ) 2025-05-07T20:32:51.6205861Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.6206652Z def test_silu_mul_quant( 2025-05-07T20:32:51.6207077Z self, 2025-05-07T20:32:51.6207401Z T: int, 2025-05-07T20:32:51.6207720Z D: int, 2025-05-07T20:32:51.6208093Z scale_ub: Optional[float], 2025-05-07T20:32:51.6208578Z contiguous: bool, 2025-05-07T20:32:51.6208974Z compiled: bool, 2025-05-07T20:32:51.6209355Z ) -> None: 2025-05-07T20:32:51.6209724Z torch.manual_seed(2025) 2025-05-07T20:32:51.6210149Z 2025-05-07T20:32:51.6210609Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.6214635Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.6218180Z 2025-05-07T20:32:51.6218380Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.6218755Z 2025-05-07T20:32:51.6218943Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.6219663Z self=, 2025-05-07T20:32:51.6220380Z T=4096, 2025-05-07T20:32:51.6220714Z D=7168, 2025-05-07T20:32:51.6221023Z scale_ub=None, 2025-05-07T20:32:51.6221382Z contiguous=True, 2025-05-07T20:32:51.6221763Z compiled=True, 2025-05-07T20:32:51.6222093Z ) 2025-05-07T20:32:51.6222565Z self = 2025-05-07T20:32:51.6223281Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:51.6223701Z 2025-05-07T20:32:51.6223816Z @given( 2025-05-07T20:32:51.6224152Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.6224624Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.6225076Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.6225910Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.6226382Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.6226796Z ) 2025-05-07T20:32:51.6227295Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.6227940Z def test_silu_mul_quant( 2025-05-07T20:32:51.6228291Z self, 2025-05-07T20:32:51.6228575Z T: int, 2025-05-07T20:32:51.6228858Z D: int, 2025-05-07T20:32:51.6229166Z scale_ub: Optional[float], 2025-05-07T20:32:51.6229542Z contiguous: bool, 2025-05-07T20:32:51.6229883Z compiled: bool, 2025-05-07T20:32:51.6230214Z ) -> None: 2025-05-07T20:32:51.6230524Z torch.manual_seed(2025) 2025-05-07T20:32:51.6230867Z 2025-05-07T20:32:51.6231249Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.6234575Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.6237401Z 2025-05-07T20:32:51.6237575Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.6237879Z 2025-05-07T20:32:51.6238029Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.6238773Z self=, 2025-05-07T20:32:51.6239356Z T=2048, 2025-05-07T20:32:51.6239609Z D=5120, 2025-05-07T20:32:51.6239873Z scale_ub=1200.0, 2025-05-07T20:32:51.6240182Z contiguous=False, 2025-05-07T20:32:51.6240485Z compiled=False, 2025-05-07T20:32:51.6240774Z ) 2025-05-07T20:32:51.6241226Z self = 2025-05-07T20:32:51.6241935Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.6242347Z 2025-05-07T20:32:51.6242457Z @given( 2025-05-07T20:32:51.6242774Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.6243232Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.6243660Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.6244132Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.6244599Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.6245014Z ) 2025-05-07T20:32:51.6245510Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.6246151Z def test_silu_mul_quant( 2025-05-07T20:32:51.6246479Z self, 2025-05-07T20:32:51.6246748Z T: int, 2025-05-07T20:32:51.6247023Z D: int, 2025-05-07T20:32:51.6247322Z scale_ub: Optional[float], 2025-05-07T20:32:51.6247701Z contiguous: bool, 2025-05-07T20:32:51.6248056Z compiled: bool, 2025-05-07T20:32:51.6248371Z ) -> None: 2025-05-07T20:32:51.6248668Z torch.manual_seed(2025) 2025-05-07T20:32:51.6249003Z 2025-05-07T20:32:51.6249410Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.6253025Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.6256444Z 2025-05-07T20:32:51.6256640Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.6256997Z 2025-05-07T20:32:51.6257161Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.6257843Z self=, 2025-05-07T20:32:51.6258504Z T=4096, 2025-05-07T20:32:51.6258795Z D=7168, 2025-05-07T20:32:51.6259117Z scale_ub=1200.0, 2025-05-07T20:32:51.6259474Z contiguous=True, 2025-05-07T20:32:51.6259822Z compiled=False, 2025-05-07T20:32:51.6260150Z ) 2025-05-07T20:32:51.7324137Z self = 2025-05-07T20:32:51.7325091Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:51.7325846Z 2025-05-07T20:32:51.7325989Z @given( 2025-05-07T20:32:51.7326355Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.7326867Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.7327364Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.7327890Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.7328423Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.7328858Z ) 2025-05-07T20:32:51.7329427Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.7330617Z def test_silu_mul_quant( 2025-05-07T20:32:51.7331014Z self, 2025-05-07T20:32:51.7331326Z T: int, 2025-05-07T20:32:51.7331619Z D: int, 2025-05-07T20:32:51.7331959Z scale_ub: Optional[float], 2025-05-07T20:32:51.7332407Z contiguous: bool, 2025-05-07T20:32:51.7333036Z compiled: bool, 2025-05-07T20:32:51.7333410Z ) -> None: 2025-05-07T20:32:51.7333750Z torch.manual_seed(2025) 2025-05-07T20:32:51.7334131Z 2025-05-07T20:32:51.7334725Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.7338325Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.7341591Z 2025-05-07T20:32:51.7341781Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.7342152Z 2025-05-07T20:32:51.7342320Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.7342988Z self=, 2025-05-07T20:32:51.7343701Z T=16384, 2025-05-07T20:32:51.7343982Z D=7168, 2025-05-07T20:32:51.7344218Z scale_ub=None, 2025-05-07T20:32:51.7344496Z contiguous=False, 2025-05-07T20:32:51.7344789Z compiled=True, 2025-05-07T20:32:51.7345066Z ) 2025-05-07T20:32:51.7345535Z self = 2025-05-07T20:32:51.7346295Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:51.7346745Z 2025-05-07T20:32:51.7346883Z @given( 2025-05-07T20:32:51.7347225Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.7347703Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.7348202Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.7348729Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.7349287Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.7349766Z ) 2025-05-07T20:32:51.7350341Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.7351110Z def test_silu_mul_quant( 2025-05-07T20:32:51.7351515Z self, 2025-05-07T20:32:51.7351821Z T: int, 2025-05-07T20:32:51.7352152Z D: int, 2025-05-07T20:32:51.7352510Z scale_ub: Optional[float], 2025-05-07T20:32:51.7352955Z contiguous: bool, 2025-05-07T20:32:51.7353363Z compiled: bool, 2025-05-07T20:32:51.7353733Z ) -> None: 2025-05-07T20:32:51.7354079Z torch.manual_seed(2025) 2025-05-07T20:32:51.7354468Z 2025-05-07T20:32:51.7354933Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.7358561Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.7361957Z 2025-05-07T20:32:51.7362166Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.7362525Z 2025-05-07T20:32:51.7362694Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.7363406Z self=, 2025-05-07T20:32:51.7364256Z T=4096, 2025-05-07T20:32:51.7364583Z D=7168, 2025-05-07T20:32:51.7364889Z scale_ub=None, 2025-05-07T20:32:51.7365250Z contiguous=True, 2025-05-07T20:32:51.7365631Z compiled=False, 2025-05-07T20:32:51.7365964Z ) 2025-05-07T20:32:51.7366614Z self = 2025-05-07T20:32:51.7367474Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.7367946Z 2025-05-07T20:32:51.7368069Z @given( 2025-05-07T20:32:51.7368448Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.7368978Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.7369482Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.7370050Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.7370613Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.7371106Z ) 2025-05-07T20:32:51.7371708Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.7372487Z def test_silu_mul_quant( 2025-05-07T20:32:51.7372874Z self, 2025-05-07T20:32:51.7373145Z T: int, 2025-05-07T20:32:51.7373437Z D: int, 2025-05-07T20:32:51.7373766Z scale_ub: Optional[float], 2025-05-07T20:32:51.7374211Z contiguous: bool, 2025-05-07T20:32:51.7374709Z compiled: bool, 2025-05-07T20:32:51.7375074Z ) -> None: 2025-05-07T20:32:51.7375412Z torch.manual_seed(2025) 2025-05-07T20:32:51.7375816Z 2025-05-07T20:32:51.7376263Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.7379910Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.7383279Z 2025-05-07T20:32:51.7383501Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.7383858Z 2025-05-07T20:32:51.7384030Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.7384726Z self=, 2025-05-07T20:32:51.7385416Z T=16384, 2025-05-07T20:32:51.7385724Z D=7168, 2025-05-07T20:32:51.7386042Z scale_ub=None, 2025-05-07T20:32:51.7386387Z contiguous=True, 2025-05-07T20:32:51.7386750Z compiled=False, 2025-05-07T20:32:51.7387086Z ) 2025-05-07T20:32:51.7387624Z self = 2025-05-07T20:32:51.7388477Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:51.7388965Z 2025-05-07T20:32:51.7389105Z @given( 2025-05-07T20:32:51.7389478Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.7390006Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.7390515Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.7391081Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.7391637Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.7392110Z ) 2025-05-07T20:32:51.7392701Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.7393463Z def test_silu_mul_quant( 2025-05-07T20:32:51.7393885Z self, 2025-05-07T20:32:51.7394162Z T: int, 2025-05-07T20:32:51.7394464Z D: int, 2025-05-07T20:32:51.7394782Z scale_ub: Optional[float], 2025-05-07T20:32:51.7395173Z contiguous: bool, 2025-05-07T20:32:51.7395547Z compiled: bool, 2025-05-07T20:32:51.7395851Z ) -> None: 2025-05-07T20:32:51.7396331Z torch.manual_seed(2025) 2025-05-07T20:32:51.7396705Z 2025-05-07T20:32:51.7397150Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.7400582Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.7404153Z 2025-05-07T20:32:51.7404373Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.7404740Z 2025-05-07T20:32:51.7404926Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.7405640Z self=, 2025-05-07T20:32:51.7406349Z T=16384, 2025-05-07T20:32:51.7406669Z D=7168, 2025-05-07T20:32:51.7406977Z scale_ub=1200.0, 2025-05-07T20:32:51.7407352Z contiguous=True, 2025-05-07T20:32:51.7407741Z compiled=False, 2025-05-07T20:32:51.7421534Z ) 2025-05-07T20:32:51.7422098Z self = 2025-05-07T20:32:51.7422945Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:51.7423436Z 2025-05-07T20:32:51.7423566Z @given( 2025-05-07T20:32:51.7423961Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.7424486Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.7425016Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.7425921Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.7426492Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.7426975Z ) 2025-05-07T20:32:51.7427582Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.7428355Z def test_silu_mul_quant( 2025-05-07T20:32:51.7428752Z self, 2025-05-07T20:32:51.7429073Z T: int, 2025-05-07T20:32:51.7429413Z D: int, 2025-05-07T20:32:51.7429765Z scale_ub: Optional[float], 2025-05-07T20:32:51.7430233Z contiguous: bool, 2025-05-07T20:32:51.7430632Z compiled: bool, 2025-05-07T20:32:51.7430996Z ) -> None: 2025-05-07T20:32:51.7431358Z torch.manual_seed(2025) 2025-05-07T20:32:51.7431765Z 2025-05-07T20:32:51.7432213Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.7435953Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.7439348Z 2025-05-07T20:32:51.7439556Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.7439932Z 2025-05-07T20:32:51.7440105Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.7440814Z self=, 2025-05-07T20:32:51.7441502Z T=128, 2025-05-07T20:32:51.7441810Z D=5120, 2025-05-07T20:32:51.7442138Z scale_ub=1200.0, 2025-05-07T20:32:51.7442506Z contiguous=False, 2025-05-07T20:32:51.7442892Z compiled=False, 2025-05-07T20:32:51.7443236Z ) 2025-05-07T20:32:51.8694265Z self = 2025-05-07T20:32:51.8695392Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:51.8695701Z 2025-05-07T20:32:51.8695791Z @given( 2025-05-07T20:32:51.8696038Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.8696370Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.8696864Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.8697214Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.8697562Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.8697857Z ) 2025-05-07T20:32:51.8698215Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.8698674Z def test_silu_mul_quant( 2025-05-07T20:32:51.8698917Z self, 2025-05-07T20:32:51.8699133Z T: int, 2025-05-07T20:32:51.8699344Z D: int, 2025-05-07T20:32:51.8699576Z scale_ub: Optional[float], 2025-05-07T20:32:51.8699863Z contiguous: bool, 2025-05-07T20:32:51.8700125Z compiled: bool, 2025-05-07T20:32:51.8700370Z ) -> None: 2025-05-07T20:32:51.8700584Z torch.manual_seed(2025) 2025-05-07T20:32:51.8700837Z 2025-05-07T20:32:51.8701125Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.8701484Z 2025-05-07T20:32:51.8701692Z x_sign = torch.sign(x) 2025-05-07T20:32:51.8701998Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.8702317Z x = x_sign * x_clamp 2025-05-07T20:32:51.8702569Z x0 = x[:, :D] 2025-05-07T20:32:51.8702800Z x1 = x[:, D:] 2025-05-07T20:32:51.8703018Z 2025-05-07T20:32:51.8703209Z if contiguous: 2025-05-07T20:32:51.8703453Z x0 = x0.contiguous() 2025-05-07T20:32:51.8703710Z x1 = x1.contiguous() 2025-05-07T20:32:51.8703965Z 2025-05-07T20:32:51.8704174Z if scale_ub is not None: 2025-05-07T20:32:51.8704449Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.8704802Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.8705133Z ) 2025-05-07T20:32:51.8705337Z else: 2025-05-07T20:32:51.8705555Z scale_ub_tensor = None 2025-05-07T20:32:51.8705821Z 2025-05-07T20:32:51.8706056Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.8706383Z op = silu_mul_quant 2025-05-07T20:32:51.8706647Z if compiled: 2025-05-07T20:32:51.8706903Z op = torch.compile(op) 2025-05-07T20:32:51.8707213Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.8707504Z 2025-05-07T20:32:51.8707703Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.8707870Z 2025-05-07T20:32:51.8707975Z moe/activation_test.py:117: 2025-05-07T20:32:51.8708280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.8708630Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.8708920Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.8709649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.8710382Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.8710949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.8711666Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.8712365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.8712928Z kernel = self.compile( 2025-05-07T20:32:51.8713496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.8714180Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.8714594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.8714831Z 2025-05-07T20:32:51.8715140Z self = 2025-05-07T20:32:51.8716270Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.8717798Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f121b943060>} 2025-05-07T20:32:51.8719216Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.8720323Z context = 2025-05-07T20:32:51.8720631Z 2025-05-07T20:32:51.8720822Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.8721371Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.8721871Z module_map=module_map) 2025-05-07T20:32:51.8722260Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.8722636Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.8722912Z E ^ 2025-05-07T20:32:51.8723402Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.8723876Z 2025-05-07T20:32:51.8724325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.8724871Z 2025-05-07T20:32:51.8724980Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.8725645Z self=, 2025-05-07T20:32:51.8726206Z T=2048, 2025-05-07T20:32:51.8726463Z D=7168, 2025-05-07T20:32:51.8726706Z scale_ub=None, 2025-05-07T20:32:51.8726999Z contiguous=False, 2025-05-07T20:32:51.8727288Z compiled=False, 2025-05-07T20:32:51.8727506Z ) 2025-05-07T20:32:51.8727849Z self = 2025-05-07T20:32:51.8728380Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:51.8728669Z 2025-05-07T20:32:51.8728751Z @given( 2025-05-07T20:32:51.8728993Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.8729320Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.8729636Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.8729978Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.8730322Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.8730630Z ) 2025-05-07T20:32:51.8730985Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.8731458Z def test_silu_mul_quant( 2025-05-07T20:32:51.8731716Z self, 2025-05-07T20:32:51.8731914Z T: int, 2025-05-07T20:32:51.8732115Z D: int, 2025-05-07T20:32:51.8732339Z scale_ub: Optional[float], 2025-05-07T20:32:51.8732621Z contiguous: bool, 2025-05-07T20:32:51.8732875Z compiled: bool, 2025-05-07T20:32:51.8733104Z ) -> None: 2025-05-07T20:32:51.8733317Z torch.manual_seed(2025) 2025-05-07T20:32:51.8733566Z 2025-05-07T20:32:51.8733846Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.8736371Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.8738375Z 2025-05-07T20:32:51.8738511Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:51.8738855Z 2025-05-07T20:32:51.8738962Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.8739403Z self=, 2025-05-07T20:32:51.8739839Z T=128, 2025-05-07T20:32:51.8740031Z D=7168, 2025-05-07T20:32:51.8740237Z scale_ub=1200.0, 2025-05-07T20:32:51.8740468Z contiguous=True, 2025-05-07T20:32:51.8740694Z compiled=True, 2025-05-07T20:32:51.8740912Z ) 2025-05-07T20:32:51.9055624Z self = 2025-05-07T20:32:51.9056235Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:51.9056550Z 2025-05-07T20:32:51.9056641Z @given( 2025-05-07T20:32:51.9056905Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.9057268Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.9057621Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.9058004Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.9058384Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.9058710Z ) 2025-05-07T20:32:51.9059114Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.9059637Z def test_silu_mul_quant( 2025-05-07T20:32:51.9059906Z self, 2025-05-07T20:32:51.9060113Z T: int, 2025-05-07T20:32:51.9060334Z D: int, 2025-05-07T20:32:51.9060576Z scale_ub: Optional[float], 2025-05-07T20:32:51.9060882Z contiguous: bool, 2025-05-07T20:32:51.9061142Z compiled: bool, 2025-05-07T20:32:51.9061390Z ) -> None: 2025-05-07T20:32:51.9061625Z torch.manual_seed(2025) 2025-05-07T20:32:51.9061893Z 2025-05-07T20:32:51.9062198Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.9062591Z 2025-05-07T20:32:51.9062795Z x_sign = torch.sign(x) 2025-05-07T20:32:51.9063118Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.9063478Z x = x_sign * x_clamp 2025-05-07T20:32:51.9063735Z x0 = x[:, :D] 2025-05-07T20:32:51.9063974Z x1 = x[:, D:] 2025-05-07T20:32:51.9064204Z 2025-05-07T20:32:51.9064400Z if contiguous: 2025-05-07T20:32:51.9064655Z x0 = x0.contiguous() 2025-05-07T20:32:51.9064944Z x1 = x1.contiguous() 2025-05-07T20:32:51.9065206Z 2025-05-07T20:32:51.9065418Z if scale_ub is not None: 2025-05-07T20:32:51.9065723Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:51.9066102Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:51.9066445Z ) 2025-05-07T20:32:51.9066659Z else: 2025-05-07T20:32:51.9066899Z scale_ub_tensor = None 2025-05-07T20:32:51.9067182Z 2025-05-07T20:32:51.9067442Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:51.9067807Z op = silu_mul_quant 2025-05-07T20:32:51.9068082Z if compiled: 2025-05-07T20:32:51.9068353Z op = torch.compile(op) 2025-05-07T20:32:51.9068685Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.9068987Z 2025-05-07T20:32:51.9069201Z > y_fp8, y_scale = fn() 2025-05-07T20:32:51.9069383Z 2025-05-07T20:32:51.9069503Z moe/activation_test.py:117: 2025-05-07T20:32:51.9069836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.9070210Z moe/activation_test.py:115: in fn 2025-05-07T20:32:51.9070524Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:51.9071185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:51.9072138Z return fn(*args, **kwargs) 2025-05-07T20:32:51.9072837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:51.9073572Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:51.9074306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:51.9075022Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:51.9075721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:51.9076285Z kernel = self.compile( 2025-05-07T20:32:51.9076846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:51.9077537Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:51.9077967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:51.9078207Z 2025-05-07T20:32:51.9078427Z self = 2025-05-07T20:32:51.9079555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:51.9081011Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f121b7cc900>} 2025-05-07T20:32:51.9082431Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:51.9083521Z context = 2025-05-07T20:32:51.9083822Z 2025-05-07T20:32:51.9084003Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:51.9084542Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:51.9085035Z module_map=module_map) 2025-05-07T20:32:51.9085416Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:51.9085775Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:51.9086047Z E ^ 2025-05-07T20:32:51.9086529Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:51.9087002Z 2025-05-07T20:32:51.9087447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:51.9087996Z 2025-05-07T20:32:51.9088103Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.9088530Z self=, 2025-05-07T20:32:51.9088959Z T=128, 2025-05-07T20:32:51.9089148Z D=7168, 2025-05-07T20:32:51.9089346Z scale_ub=1200.0, 2025-05-07T20:32:51.9089575Z contiguous=True, 2025-05-07T20:32:51.9089795Z compiled=False, 2025-05-07T20:32:51.9090009Z ) 2025-05-07T20:32:51.9090336Z self = 2025-05-07T20:32:51.9090849Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:51.9091132Z 2025-05-07T20:32:51.9091214Z @given( 2025-05-07T20:32:51.9091475Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.9091798Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.9092117Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.9092452Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.9092797Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.9093095Z ) 2025-05-07T20:32:51.9093539Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.9094001Z def test_silu_mul_quant( 2025-05-07T20:32:51.9094249Z self, 2025-05-07T20:32:51.9094590Z T: int, 2025-05-07T20:32:51.9094793Z D: int, 2025-05-07T20:32:51.9095103Z scale_ub: Optional[float], 2025-05-07T20:32:51.9095381Z contiguous: bool, 2025-05-07T20:32:51.9095618Z compiled: bool, 2025-05-07T20:32:51.9095844Z ) -> None: 2025-05-07T20:32:51.9096061Z torch.manual_seed(2025) 2025-05-07T20:32:51.9096304Z 2025-05-07T20:32:51.9096580Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.9096942Z 2025-05-07T20:32:51.9097131Z x_sign = torch.sign(x) 2025-05-07T20:32:51.9097427Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:51.9099574Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.9101572Z 2025-05-07T20:32:51.9101696Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:51.9101915Z 2025-05-07T20:32:51.9102029Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.9102452Z self=, 2025-05-07T20:32:51.9102880Z T=128, 2025-05-07T20:32:51.9103068Z D=5120, 2025-05-07T20:32:51.9103253Z scale_ub=1200.0, 2025-05-07T20:32:51.9103471Z contiguous=True, 2025-05-07T20:32:51.9103692Z compiled=True, 2025-05-07T20:32:51.9103891Z ) 2025-05-07T20:32:51.9104217Z self = 2025-05-07T20:32:51.9104727Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:51.9105006Z 2025-05-07T20:32:51.9105084Z @given( 2025-05-07T20:32:51.9105316Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:51.9105630Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:51.9105939Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:51.9106266Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:51.9106602Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:51.9106895Z ) 2025-05-07T20:32:51.9107243Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:51.9107701Z def test_silu_mul_quant( 2025-05-07T20:32:51.9107948Z self, 2025-05-07T20:32:51.9108143Z T: int, 2025-05-07T20:32:51.9108347Z D: int, 2025-05-07T20:32:51.9108569Z scale_ub: Optional[float], 2025-05-07T20:32:51.9108837Z contiguous: bool, 2025-05-07T20:32:51.9109079Z compiled: bool, 2025-05-07T20:32:51.9109299Z ) -> None: 2025-05-07T20:32:51.9109506Z torch.manual_seed(2025) 2025-05-07T20:32:51.9109755Z 2025-05-07T20:32:51.9110030Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:51.9110377Z 2025-05-07T20:32:51.9110573Z > x_sign = torch.sign(x) 2025-05-07T20:32:51.9112742Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:51.9114742Z 2025-05-07T20:32:51.9114865Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:51.9115085Z 2025-05-07T20:32:51.9115198Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:51.9115702Z self=, 2025-05-07T20:32:51.9116132Z T=128, 2025-05-07T20:32:51.9116330Z D=7168, 2025-05-07T20:32:51.9116528Z scale_ub=None, 2025-05-07T20:32:51.9116752Z contiguous=True, 2025-05-07T20:32:51.9116983Z compiled=True, 2025-05-07T20:32:51.9117190Z ) 2025-05-07T20:32:52.2511361Z self = 2025-05-07T20:32:52.2511926Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:52.2512223Z 2025-05-07T20:32:52.2512309Z @given( 2025-05-07T20:32:52.2512558Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.2512909Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.2513232Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.2513582Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.2513960Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.2514286Z ) 2025-05-07T20:32:52.2514664Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.2515141Z def test_silu_mul_quant( 2025-05-07T20:32:52.2515393Z self, 2025-05-07T20:32:52.2515617Z T: int, 2025-05-07T20:32:52.2515832Z D: int, 2025-05-07T20:32:52.2516059Z scale_ub: Optional[float], 2025-05-07T20:32:52.2516354Z contiguous: bool, 2025-05-07T20:32:52.2516615Z compiled: bool, 2025-05-07T20:32:52.2516846Z ) -> None: 2025-05-07T20:32:52.2517072Z torch.manual_seed(2025) 2025-05-07T20:32:52.2517329Z 2025-05-07T20:32:52.2517610Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.2519825Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.2521848Z 2025-05-07T20:32:52.2521973Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:52.2522205Z 2025-05-07T20:32:52.2608603Z FAILED 2025-05-07T20:32:52.2608750Z 2025-05-07T20:32:52.2608888Z =================================== FAILURES =================================== 2025-05-07T20:32:52.2609341Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:32:52.2609806Z + Exception Group Traceback (most recent call last): 2025-05-07T20:32:52.2610456Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:32:52.2611028Z | yield 2025-05-07T20:32:52.2611492Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run 2025-05-07T20:32:52.2612177Z | self._callTestMethod(testMethod) 2025-05-07T20:32:52.2612830Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod 2025-05-07T20:32:52.2613421Z | if method() is not None: 2025-05-07T20:32:52.2613675Z | ^^^^^^^^ 2025-05-07T20:32:52.2614638Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:32:52.2615522Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.2616140Z | ^^^^^^^ 2025-05-07T20:32:52.2616768Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:32:52.2617436Z | raise the_error_hypothesis_found 2025-05-07T20:32:52.2618021Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:32:52.2618464Z +-+---------------- 1 ---------------- 2025-05-07T20:32:52.2618777Z | Traceback (most recent call last): 2025-05-07T20:32:52.2619538Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:52.2620433Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.2620835Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:52.2623235Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.2639944Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:52.2640614Z | self=, 2025-05-07T20:32:52.2641073Z | T=128, 2025-05-07T20:32:52.2641351Z | D=7168, 2025-05-07T20:32:52.2641604Z | scale_ub=1200.0, 2025-05-07T20:32:52.2641932Z | contiguous=True, 2025-05-07T20:32:52.2642179Z | compiled=False, 2025-05-07T20:32:52.2642420Z | ) 2025-05-07T20:32:52.2642622Z | 2025-05-07T20:32:52.2643187Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case 2025-05-07T20:32:52.2643839Z +---------------- 2 ---------------- 2025-05-07T20:32:52.2644158Z | Traceback (most recent call last): 2025-05-07T20:32:52.2644965Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:52.2645840Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.2646239Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:52.2648360Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.2650476Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:52.2650941Z | self=, 2025-05-07T20:32:52.2651372Z | T=128, 2025-05-07T20:32:52.2651585Z | D=7168, 2025-05-07T20:32:52.2651804Z | scale_ub=None, 2025-05-07T20:32:52.2652046Z | contiguous=True, 2025-05-07T20:32:52.2652298Z | compiled=True, 2025-05-07T20:32:52.2652534Z | ) 2025-05-07T20:32:52.2652717Z | 2025-05-07T20:32:52.2653265Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:52.2653916Z +---------------- 3 ---------------- 2025-05-07T20:32:52.2654517Z | Traceback (most recent call last): 2025-05-07T20:32:52.2655271Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:32:52.2656211Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.2656607Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:52.2658714Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.2660809Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:52.2661282Z | self=, 2025-05-07T20:32:52.2661715Z | T=128, 2025-05-07T20:32:52.2661936Z | D=5120, 2025-05-07T20:32:52.2662153Z | scale_ub=1200.0, 2025-05-07T20:32:52.2662412Z | contiguous=True, 2025-05-07T20:32:52.2662672Z | compiled=True, 2025-05-07T20:32:52.2662908Z | ) 2025-05-07T20:32:52.2663123Z | 2025-05-07T20:32:52.2663701Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:32:52.2664603Z +---------------- 4 ---------------- 2025-05-07T20:32:52.2665028Z | Traceback (most recent call last): 2025-05-07T20:32:52.2666084Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:32:52.2667131Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:52.2667554Z | ^^^^^^^^ 2025-05-07T20:32:52.2668480Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:32:52.2669511Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.2669986Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:52.2671158Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:32:52.2672327Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:52.2673213Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:32:52.2674288Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.2674950Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:52.2675905Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:32:52.2677034Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.2677712Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:52.2678637Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:32:52.2679651Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:52.2680179Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:52.2681134Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:32:52.2681972Z | fn() 2025-05-07T20:32:52.2682802Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:32:52.2683699Z | self.fn.run( 2025-05-07T20:32:52.2684264Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:32:52.2684879Z | kernel = self.compile( 2025-05-07T20:32:52.2685157Z | ^^^^^^^^^^^^^ 2025-05-07T20:32:52.2685775Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:32:52.2686524Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.2686933Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:52.2687620Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:32:52.2688467Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.2688976Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:32:52.2689377Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.2689745Z | def _kernel_quantize_fp8_row( 2025-05-07T20:32:52.2690022Z | ^ 2025-05-07T20:32:52.2690507Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.2691101Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:32:52.2691513Z | # The test always failed when commented parts were varied together. 2025-05-07T20:32:52.2692060Z | self=, 2025-05-07T20:32:52.2692515Z | T=1, # or any other generated value 2025-05-07T20:32:52.2692832Z | D=5120, # or any other generated value 2025-05-07T20:32:52.2693186Z | scale_ub=None, # or any other generated value 2025-05-07T20:32:52.2693565Z | contiguous=True, # or any other generated value 2025-05-07T20:32:52.2693975Z | compiled=True, # or any other generated value 2025-05-07T20:32:52.2694300Z | ) 2025-05-07T20:32:52.2694604Z | 2025-05-07T20:32:52.2695158Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:32:52.2695789Z +------------------------------------ 2025-05-07T20:32:52.2696160Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:32:52.2696556Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.2696983Z self=, 2025-05-07T20:32:52.2697418Z T=1, 2025-05-07T20:32:52.2697618Z D=5120, 2025-05-07T20:32:52.2697815Z scale_ub=None, 2025-05-07T20:32:52.2698043Z contiguous=True, 2025-05-07T20:32:52.2698282Z compiled=True, 2025-05-07T20:32:52.2698503Z ) 2025-05-07T20:32:52.2698840Z self = 2025-05-07T20:32:52.2699348Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:52.2699623Z 2025-05-07T20:32:52.2699715Z @given( 2025-05-07T20:32:52.2699952Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.2700287Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.2700611Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.2700952Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.2701374Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.2701784Z ) 2025-05-07T20:32:52.2702393Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.2703047Z def test_silu_mul_quant( 2025-05-07T20:32:52.2703406Z self, 2025-05-07T20:32:52.2703689Z T: int, 2025-05-07T20:32:52.2703949Z D: int, 2025-05-07T20:32:52.2704343Z scale_ub: Optional[float], 2025-05-07T20:32:52.2704724Z contiguous: bool, 2025-05-07T20:32:52.2705055Z compiled: bool, 2025-05-07T20:32:52.2705382Z ) -> None: 2025-05-07T20:32:52.2705703Z torch.manual_seed(2025) 2025-05-07T20:32:52.2706051Z 2025-05-07T20:32:52.2706447Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.2706950Z 2025-05-07T20:32:52.2707222Z x_sign = torch.sign(x) 2025-05-07T20:32:52.2707646Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.2708092Z x = x_sign * x_clamp 2025-05-07T20:32:52.2708436Z x0 = x[:, :D] 2025-05-07T20:32:52.2708768Z x1 = x[:, D:] 2025-05-07T20:32:52.2709077Z 2025-05-07T20:32:52.2709348Z if contiguous: 2025-05-07T20:32:52.2709692Z x0 = x0.contiguous() 2025-05-07T20:32:52.2710068Z x1 = x1.contiguous() 2025-05-07T20:32:52.2710424Z 2025-05-07T20:32:52.2710693Z if scale_ub is not None: 2025-05-07T20:32:52.2711103Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.2711586Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.2712028Z ) 2025-05-07T20:32:52.2712312Z else: 2025-05-07T20:32:52.2712619Z scale_ub_tensor = None 2025-05-07T20:32:52.2712990Z 2025-05-07T20:32:52.2713318Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.2713797Z op = silu_mul_quant 2025-05-07T20:32:52.2714163Z if compiled: 2025-05-07T20:32:52.2714532Z op = torch.compile(op) 2025-05-07T20:32:52.2714958Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.2715361Z 2025-05-07T20:32:52.2715649Z y_fp8, y_scale = fn() 2025-05-07T20:32:52.2716062Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:52.2716502Z 2025-05-07T20:32:52.2716839Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.2717336Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:52.2717778Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:52.2718229Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:52.2718765Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.2719250Z 2025-05-07T20:32:52.2719541Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:52.2719840Z 2025-05-07T20:32:52.2719986Z moe/activation_test.py:126: 2025-05-07T20:32:52.2720445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2720947Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:52.2721415Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.2722580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:52.2723688Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:52.2724533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.2725850Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.2726888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:52.2727975Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.2729017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:52.2729952Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:52.2731063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:52.2731841Z fn() 2025-05-07T20:32:52.2732583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:52.2733589Z self.fn.run( 2025-05-07T20:32:52.2734270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.2735186Z kernel = self.compile( 2025-05-07T20:32:52.2735972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.2736926Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.2737496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2737820Z 2025-05-07T20:32:52.2738103Z self = 2025-05-07T20:32:52.2739650Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.2741687Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1295394540>} 2025-05-07T20:32:52.2743675Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.2745178Z context = 2025-05-07T20:32:52.2745596Z 2025-05-07T20:32:52.2745826Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.2746546Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.2747185Z module_map=module_map) 2025-05-07T20:32:52.2747659Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.2748137Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:52.2748517Z E ^ 2025-05-07T20:32:52.2749137Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.2749767Z 2025-05-07T20:32:52.2750342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.2751061Z 2025-05-07T20:32:52.2751205Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.2751777Z self=, 2025-05-07T20:32:52.2752325Z T=2048, 2025-05-07T20:32:52.2752586Z D=5120, 2025-05-07T20:32:52.2752849Z scale_ub=1200.0, 2025-05-07T20:32:52.2753157Z contiguous=True, 2025-05-07T20:32:52.2753481Z compiled=False, 2025-05-07T20:32:52.2753785Z ) 2025-05-07T20:32:52.2754266Z self = 2025-05-07T20:32:52.2754948Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:52.2755336Z 2025-05-07T20:32:52.2755440Z @given( 2025-05-07T20:32:52.2755734Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.2756143Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.2756593Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.2757060Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.2757475Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.2757867Z ) 2025-05-07T20:32:52.2758335Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.2758910Z def test_silu_mul_quant( 2025-05-07T20:32:52.2759231Z self, 2025-05-07T20:32:52.2759582Z T: int, 2025-05-07T20:32:52.2759825Z D: int, 2025-05-07T20:32:52.2760122Z scale_ub: Optional[float], 2025-05-07T20:32:52.2760493Z contiguous: bool, 2025-05-07T20:32:52.2760814Z compiled: bool, 2025-05-07T20:32:52.2761204Z ) -> None: 2025-05-07T20:32:52.2761494Z torch.manual_seed(2025) 2025-05-07T20:32:52.2761813Z 2025-05-07T20:32:52.2762141Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.2762605Z 2025-05-07T20:32:52.2762871Z x_sign = torch.sign(x) 2025-05-07T20:32:52.2763257Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.2763690Z x = x_sign * x_clamp 2025-05-07T20:32:52.2764027Z x0 = x[:, :D] 2025-05-07T20:32:52.2764325Z x1 = x[:, D:] 2025-05-07T20:32:52.2764614Z 2025-05-07T20:32:52.2764867Z if contiguous: 2025-05-07T20:32:52.2765177Z x0 = x0.contiguous() 2025-05-07T20:32:52.2765547Z x1 = x1.contiguous() 2025-05-07T20:32:52.2765874Z 2025-05-07T20:32:52.2766124Z if scale_ub is not None: 2025-05-07T20:32:52.2766492Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.2766932Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.2767348Z ) 2025-05-07T20:32:52.2767596Z else: 2025-05-07T20:32:52.2767870Z scale_ub_tensor = None 2025-05-07T20:32:52.2768203Z 2025-05-07T20:32:52.2768496Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.2768918Z op = silu_mul_quant 2025-05-07T20:32:52.2769246Z if compiled: 2025-05-07T20:32:52.2769558Z op = torch.compile(op) 2025-05-07T20:32:52.2769955Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.2770330Z 2025-05-07T20:32:52.2770576Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.2770796Z 2025-05-07T20:32:52.2770930Z moe/activation_test.py:117: 2025-05-07T20:32:52.2771328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2771771Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.2772147Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.2773065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.2774005Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.2774826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.2775751Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.2776649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.2777370Z kernel = self.compile( 2025-05-07T20:32:52.2778095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.2778982Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.2779512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2779815Z 2025-05-07T20:32:52.2780094Z self = 2025-05-07T20:32:52.2781544Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.2783414Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1295c8f240>} 2025-05-07T20:32:52.2785237Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.2786640Z context = 2025-05-07T20:32:52.2787034Z 2025-05-07T20:32:52.2787241Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.2787993Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.2788591Z module_map=module_map) 2025-05-07T20:32:52.2789059Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.2789519Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.2789876Z E ^ 2025-05-07T20:32:52.2790475Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.2791080Z 2025-05-07T20:32:52.2791609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.2792280Z 2025-05-07T20:32:52.2792416Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.2792938Z self=, 2025-05-07T20:32:52.2793452Z T=2048, 2025-05-07T20:32:52.2793703Z D=5120, 2025-05-07T20:32:52.2793975Z scale_ub=1200.0, 2025-05-07T20:32:52.2794269Z contiguous=True, 2025-05-07T20:32:52.2794557Z compiled=True, 2025-05-07T20:32:52.2794816Z ) 2025-05-07T20:32:52.2795231Z self = 2025-05-07T20:32:52.2795881Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:52.2796253Z 2025-05-07T20:32:52.2796357Z @given( 2025-05-07T20:32:52.2796653Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.2797053Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.2797459Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.2797903Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.2798340Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.2798719Z ) 2025-05-07T20:32:52.2799183Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.2799784Z def test_silu_mul_quant( 2025-05-07T20:32:52.2800093Z self, 2025-05-07T20:32:52.2800352Z T: int, 2025-05-07T20:32:52.2800612Z D: int, 2025-05-07T20:32:52.2800892Z scale_ub: Optional[float], 2025-05-07T20:32:52.2801250Z contiguous: bool, 2025-05-07T20:32:52.2801568Z compiled: bool, 2025-05-07T20:32:52.2801855Z ) -> None: 2025-05-07T20:32:52.2802143Z torch.manual_seed(2025) 2025-05-07T20:32:52.2802462Z 2025-05-07T20:32:52.2802815Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.2803273Z 2025-05-07T20:32:52.2803530Z x_sign = torch.sign(x) 2025-05-07T20:32:52.2803902Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.2804320Z x = x_sign * x_clamp 2025-05-07T20:32:52.2804642Z x0 = x[:, :D] 2025-05-07T20:32:52.2804950Z x1 = x[:, D:] 2025-05-07T20:32:52.2805226Z 2025-05-07T20:32:52.2805489Z if contiguous: 2025-05-07T20:32:52.2805814Z x0 = x0.contiguous() 2025-05-07T20:32:52.2806157Z x1 = x1.contiguous() 2025-05-07T20:32:52.2806498Z 2025-05-07T20:32:52.2806777Z if scale_ub is not None: 2025-05-07T20:32:52.2807137Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.2807600Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.2808041Z ) 2025-05-07T20:32:52.2808310Z else: 2025-05-07T20:32:52.2808630Z scale_ub_tensor = None 2025-05-07T20:32:52.2809018Z 2025-05-07T20:32:52.2809337Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.2809761Z op = silu_mul_quant 2025-05-07T20:32:52.2810100Z if compiled: 2025-05-07T20:32:52.2810524Z op = torch.compile(op) 2025-05-07T20:32:52.2810938Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.2811313Z 2025-05-07T20:32:52.2811580Z y_fp8, y_scale = fn() 2025-05-07T20:32:52.2811951Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:52.2812478Z 2025-05-07T20:32:52.2812805Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.2813267Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:52.2813684Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:52.2814134Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:52.2814832Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.2815264Z 2025-05-07T20:32:52.2815550Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:52.2815829Z 2025-05-07T20:32:52.2815971Z moe/activation_test.py:126: 2025-05-07T20:32:52.2816404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2816860Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:52.2817300Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.2818399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:52.2819452Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:52.2820210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.2821174Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.2839756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:52.2840774Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.2841807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:52.2842697Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:52.2843541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:52.2844300Z fn() 2025-05-07T20:32:52.2845052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:52.2845904Z self.fn.run( 2025-05-07T20:32:52.2846566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.2847322Z kernel = self.compile( 2025-05-07T20:32:52.2848058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.2848952Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.2849510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2849822Z 2025-05-07T20:32:52.2850098Z self = 2025-05-07T20:32:52.2851621Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.2853568Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f129679a020>} 2025-05-07T20:32:52.2855674Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.2857170Z context = 2025-05-07T20:32:52.2857896Z 2025-05-07T20:32:52.2858136Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.2858893Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.2859725Z module_map=module_map) 2025-05-07T20:32:52.2860246Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.2860747Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:52.2861118Z E ^ 2025-05-07T20:32:52.2861766Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.2862415Z 2025-05-07T20:32:52.2863035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.2863765Z 2025-05-07T20:32:52.2863914Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.2864503Z self=, 2025-05-07T20:32:52.2865087Z T=16384, 2025-05-07T20:32:52.2865366Z D=7168, 2025-05-07T20:32:52.2865651Z scale_ub=1200.0, 2025-05-07T20:32:52.2865973Z contiguous=False, 2025-05-07T20:32:52.2866293Z compiled=False, 2025-05-07T20:32:52.2866582Z ) 2025-05-07T20:32:52.2867030Z self = 2025-05-07T20:32:52.2867726Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:52.2868124Z 2025-05-07T20:32:52.2868229Z @given( 2025-05-07T20:32:52.2868543Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.2868976Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.2869388Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.2869824Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.2870251Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.2870627Z ) 2025-05-07T20:32:52.2871129Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.2871752Z def test_silu_mul_quant( 2025-05-07T20:32:52.2872083Z self, 2025-05-07T20:32:52.2872342Z T: int, 2025-05-07T20:32:52.2872626Z D: int, 2025-05-07T20:32:52.2872912Z scale_ub: Optional[float], 2025-05-07T20:32:52.2873278Z contiguous: bool, 2025-05-07T20:32:52.2873599Z compiled: bool, 2025-05-07T20:32:52.2873896Z ) -> None: 2025-05-07T20:32:52.2874187Z torch.manual_seed(2025) 2025-05-07T20:32:52.2874522Z 2025-05-07T20:32:52.2874886Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.2875348Z 2025-05-07T20:32:52.2875617Z x_sign = torch.sign(x) 2025-05-07T20:32:52.2876011Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.2876430Z x = x_sign * x_clamp 2025-05-07T20:32:52.2876761Z x0 = x[:, :D] 2025-05-07T20:32:52.2877072Z x1 = x[:, D:] 2025-05-07T20:32:52.2877367Z 2025-05-07T20:32:52.2877627Z if contiguous: 2025-05-07T20:32:52.2877956Z x0 = x0.contiguous() 2025-05-07T20:32:52.2878318Z x1 = x1.contiguous() 2025-05-07T20:32:52.2878665Z 2025-05-07T20:32:52.2878947Z if scale_ub is not None: 2025-05-07T20:32:52.2879330Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.2879786Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.2880204Z ) 2025-05-07T20:32:52.2880440Z else: 2025-05-07T20:32:52.2880665Z scale_ub_tensor = None 2025-05-07T20:32:52.2880929Z 2025-05-07T20:32:52.2881178Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.2881508Z op = silu_mul_quant 2025-05-07T20:32:52.2881776Z if compiled: 2025-05-07T20:32:52.2882039Z op = torch.compile(op) 2025-05-07T20:32:52.2882344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.2882747Z 2025-05-07T20:32:52.2882955Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.2883123Z 2025-05-07T20:32:52.2883232Z moe/activation_test.py:117: 2025-05-07T20:32:52.2883550Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2883987Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.2884310Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.2885065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.2885804Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.2886379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.2887100Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.2887816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.2888382Z kernel = self.compile( 2025-05-07T20:32:52.2888953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.2889642Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.2890070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2890309Z 2025-05-07T20:32:52.2890530Z self = 2025-05-07T20:32:52.2891662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.2893106Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1295e4a8e0>} 2025-05-07T20:32:52.2894657Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.2895759Z context = 2025-05-07T20:32:52.2896061Z 2025-05-07T20:32:52.2896242Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.2896790Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.2897290Z module_map=module_map) 2025-05-07T20:32:52.2897682Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.2898069Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.2898345Z E ^ 2025-05-07T20:32:52.2898844Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.2899329Z 2025-05-07T20:32:52.2899782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.2900332Z 2025-05-07T20:32:52.2900444Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.2900895Z self=, 2025-05-07T20:32:52.2901335Z T=1, 2025-05-07T20:32:52.2901539Z D=7168, 2025-05-07T20:32:52.2901741Z scale_ub=None, 2025-05-07T20:32:52.2901978Z contiguous=True, 2025-05-07T20:32:52.2902226Z compiled=True, 2025-05-07T20:32:52.2902442Z ) 2025-05-07T20:32:52.2902793Z self = 2025-05-07T20:32:52.2903317Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:52.2903591Z 2025-05-07T20:32:52.2903674Z @given( 2025-05-07T20:32:52.2903920Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.2904349Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.2904673Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.2905023Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.2905379Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.2905765Z ) 2025-05-07T20:32:52.2906134Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.2906611Z def test_silu_mul_quant( 2025-05-07T20:32:52.2906880Z self, 2025-05-07T20:32:52.2907089Z T: int, 2025-05-07T20:32:52.2907298Z D: int, 2025-05-07T20:32:52.2907541Z scale_ub: Optional[float], 2025-05-07T20:32:52.2907833Z contiguous: bool, 2025-05-07T20:32:52.2908093Z compiled: bool, 2025-05-07T20:32:52.2908329Z ) -> None: 2025-05-07T20:32:52.2908550Z torch.manual_seed(2025) 2025-05-07T20:32:52.2908803Z 2025-05-07T20:32:52.2909103Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.2909461Z 2025-05-07T20:32:52.2909668Z x_sign = torch.sign(x) 2025-05-07T20:32:52.2909974Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.2910296Z x = x_sign * x_clamp 2025-05-07T20:32:52.2910563Z x0 = x[:, :D] 2025-05-07T20:32:52.2910803Z x1 = x[:, D:] 2025-05-07T20:32:52.2911028Z 2025-05-07T20:32:52.2911221Z if contiguous: 2025-05-07T20:32:52.2911473Z x0 = x0.contiguous() 2025-05-07T20:32:52.2911749Z x1 = x1.contiguous() 2025-05-07T20:32:52.2911993Z 2025-05-07T20:32:52.2912199Z if scale_ub is not None: 2025-05-07T20:32:52.2912488Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.2912835Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.2913166Z ) 2025-05-07T20:32:52.2913369Z else: 2025-05-07T20:32:52.2913580Z scale_ub_tensor = None 2025-05-07T20:32:52.2913869Z 2025-05-07T20:32:52.2914142Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.2914466Z op = silu_mul_quant 2025-05-07T20:32:52.2914730Z if compiled: 2025-05-07T20:32:52.2914995Z op = torch.compile(op) 2025-05-07T20:32:52.2915306Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.2915599Z 2025-05-07T20:32:52.2915813Z y_fp8, y_scale = fn() 2025-05-07T20:32:52.2916104Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:52.2916413Z 2025-05-07T20:32:52.2916661Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.2917014Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:52.2917315Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:52.2917644Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:52.2918021Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.2918341Z 2025-05-07T20:32:52.2918561Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:52.2918766Z 2025-05-07T20:32:52.2918880Z moe/activation_test.py:126: 2025-05-07T20:32:52.2919184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2919536Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:52.2919887Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.2920727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:52.2921524Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:52.2922108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.2922838Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.2923574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:52.2924424Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.2925210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:52.2926434Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:52.2927077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:52.2927643Z fn() 2025-05-07T20:32:52.2928188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:52.2928815Z self.fn.run( 2025-05-07T20:32:52.2929310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.2929879Z kernel = self.compile( 2025-05-07T20:32:52.2930459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.2931155Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.2931569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2931823Z 2025-05-07T20:32:52.2932039Z self = 2025-05-07T20:32:52.2933187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.2934781Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f129587c860>} 2025-05-07T20:32:52.2936202Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.2937307Z context = 2025-05-07T20:32:52.2937622Z 2025-05-07T20:32:52.2937799Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.2938359Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.2938846Z module_map=module_map) 2025-05-07T20:32:52.2939237Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.2939614Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:52.2939893Z E ^ 2025-05-07T20:32:52.2940386Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.2940869Z 2025-05-07T20:32:52.2941317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.2941864Z 2025-05-07T20:32:52.2941980Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.2942412Z self=, 2025-05-07T20:32:52.2942846Z T=4096, 2025-05-07T20:32:52.2943054Z D=5120, 2025-05-07T20:32:52.2943255Z scale_ub=None, 2025-05-07T20:32:52.2943493Z contiguous=False, 2025-05-07T20:32:52.2943732Z compiled=False, 2025-05-07T20:32:52.2943942Z ) 2025-05-07T20:32:52.2944284Z self = 2025-05-07T20:32:52.2944813Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:52.2945103Z 2025-05-07T20:32:52.2945193Z @given( 2025-05-07T20:32:52.2945435Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.2945768Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.2946097Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.2946644Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.2946994Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.2947298Z ) 2025-05-07T20:32:52.2947658Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.2948902Z def test_silu_mul_quant( 2025-05-07T20:32:52.2949167Z self, 2025-05-07T20:32:52.2949380Z T: int, 2025-05-07T20:32:52.2949585Z D: int, 2025-05-07T20:32:52.2949819Z scale_ub: Optional[float], 2025-05-07T20:32:52.2950108Z contiguous: bool, 2025-05-07T20:32:52.2950357Z compiled: bool, 2025-05-07T20:32:52.2950596Z ) -> None: 2025-05-07T20:32:52.2950820Z torch.manual_seed(2025) 2025-05-07T20:32:52.2951069Z 2025-05-07T20:32:52.2951354Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.2951717Z 2025-05-07T20:32:52.2951912Z x_sign = torch.sign(x) 2025-05-07T20:32:52.2952219Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.2952539Z x = x_sign * x_clamp 2025-05-07T20:32:52.2952779Z x0 = x[:, :D] 2025-05-07T20:32:52.2953002Z x1 = x[:, D:] 2025-05-07T20:32:52.2953223Z 2025-05-07T20:32:52.2953411Z if contiguous: 2025-05-07T20:32:52.2953662Z x0 = x0.contiguous() 2025-05-07T20:32:52.2953927Z x1 = x1.contiguous() 2025-05-07T20:32:52.2954177Z 2025-05-07T20:32:52.2954365Z if scale_ub is not None: 2025-05-07T20:32:52.2954646Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.2954989Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.2955296Z ) 2025-05-07T20:32:52.2955501Z else: 2025-05-07T20:32:52.2955722Z scale_ub_tensor = None 2025-05-07T20:32:52.2955982Z 2025-05-07T20:32:52.2956222Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.2956543Z op = silu_mul_quant 2025-05-07T20:32:52.2956800Z if compiled: 2025-05-07T20:32:52.2957054Z op = torch.compile(op) 2025-05-07T20:32:52.2957359Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.2957634Z 2025-05-07T20:32:52.2959283Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.2959456Z 2025-05-07T20:32:52.2959562Z moe/activation_test.py:117: 2025-05-07T20:32:52.2959863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2960203Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.2960495Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.2961214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.2961942Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.2962507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.2963238Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.2963949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.2964514Z kernel = self.compile( 2025-05-07T20:32:52.2965097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.2965798Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.2966211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2966461Z 2025-05-07T20:32:52.2966673Z self = 2025-05-07T20:32:52.2967812Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.2969342Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127b0a8180>} 2025-05-07T20:32:52.2970926Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.2972128Z context = 2025-05-07T20:32:52.2972448Z 2025-05-07T20:32:52.2972626Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.2973191Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.2973714Z module_map=module_map) 2025-05-07T20:32:52.2974095Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.2974582Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.2974875Z E ^ 2025-05-07T20:32:52.2975365Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.2975849Z 2025-05-07T20:32:52.2976296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.2976862Z 2025-05-07T20:32:52.2976970Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.2977408Z self=, 2025-05-07T20:32:52.2977831Z T=4096, 2025-05-07T20:32:52.2978039Z D=7168, 2025-05-07T20:32:52.2978251Z scale_ub=None, 2025-05-07T20:32:52.2978475Z contiguous=False, 2025-05-07T20:32:52.2978721Z compiled=False, 2025-05-07T20:32:52.2978944Z ) 2025-05-07T20:32:52.2979271Z self = 2025-05-07T20:32:52.2979797Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:52.2980084Z 2025-05-07T20:32:52.2980175Z @given( 2025-05-07T20:32:52.2980467Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.2980894Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.2981220Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.2981565Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.2981896Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.2982195Z ) 2025-05-07T20:32:52.2982549Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.2982998Z def test_silu_mul_quant( 2025-05-07T20:32:52.2983246Z self, 2025-05-07T20:32:52.2983448Z T: int, 2025-05-07T20:32:52.2983641Z D: int, 2025-05-07T20:32:52.2983868Z scale_ub: Optional[float], 2025-05-07T20:32:52.2984154Z contiguous: bool, 2025-05-07T20:32:52.2984429Z compiled: bool, 2025-05-07T20:32:52.2984679Z ) -> None: 2025-05-07T20:32:52.2984899Z torch.manual_seed(2025) 2025-05-07T20:32:52.2985141Z 2025-05-07T20:32:52.2985418Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.2985778Z 2025-05-07T20:32:52.2985989Z x_sign = torch.sign(x) 2025-05-07T20:32:52.2986278Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.2986602Z x = x_sign * x_clamp 2025-05-07T20:32:52.2986850Z x0 = x[:, :D] 2025-05-07T20:32:52.2987067Z x1 = x[:, D:] 2025-05-07T20:32:52.2987283Z 2025-05-07T20:32:52.2987475Z if contiguous: 2025-05-07T20:32:52.2987707Z x0 = x0.contiguous() 2025-05-07T20:32:52.2987977Z x1 = x1.contiguous() 2025-05-07T20:32:52.2988235Z 2025-05-07T20:32:52.2988427Z if scale_ub is not None: 2025-05-07T20:32:52.2988708Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.2989053Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.2989465Z ) 2025-05-07T20:32:52.2989675Z else: 2025-05-07T20:32:52.2989904Z scale_ub_tensor = None 2025-05-07T20:32:52.2990163Z 2025-05-07T20:32:52.2990411Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.2990836Z op = silu_mul_quant 2025-05-07T20:32:52.2991112Z if compiled: 2025-05-07T20:32:52.2991375Z op = torch.compile(op) 2025-05-07T20:32:52.2991695Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.2992001Z 2025-05-07T20:32:52.2992208Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.2992394Z 2025-05-07T20:32:52.2992498Z moe/activation_test.py:117: 2025-05-07T20:32:52.2992821Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2993167Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.2993478Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.2994276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.2995023Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.2995593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.2996336Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.2997054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.2997619Z kernel = self.compile( 2025-05-07T20:32:52.2998201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.2998913Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.2999332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.2999572Z 2025-05-07T20:32:52.2999792Z self = 2025-05-07T20:32:52.3000924Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3002360Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127b0a9080>} 2025-05-07T20:32:52.3003775Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3004881Z context = 2025-05-07T20:32:52.3005185Z 2025-05-07T20:32:52.3005360Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3005914Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3006410Z module_map=module_map) 2025-05-07T20:32:52.3006578Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3006687Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3006777Z E ^ 2025-05-07T20:32:52.3007153Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3007158Z 2025-05-07T20:32:52.3007607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3007611Z 2025-05-07T20:32:52.3007719Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3007952Z self=, 2025-05-07T20:32:52.3008046Z T=128, 2025-05-07T20:32:52.3008130Z D=7168, 2025-05-07T20:32:52.3008336Z scale_ub=None, 2025-05-07T20:32:52.3008440Z contiguous=False, 2025-05-07T20:32:52.3008528Z compiled=True, 2025-05-07T20:32:52.3008619Z ) 2025-05-07T20:32:52.3008849Z self = 2025-05-07T20:32:52.3009109Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.3009113Z 2025-05-07T20:32:52.3009201Z @given( 2025-05-07T20:32:52.3009325Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3009431Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3009558Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3009682Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3009799Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3009888Z ) 2025-05-07T20:32:52.3010143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3010252Z def test_silu_mul_quant( 2025-05-07T20:32:52.3010340Z self, 2025-05-07T20:32:52.3010422Z T: int, 2025-05-07T20:32:52.3010518Z D: int, 2025-05-07T20:32:52.3010627Z scale_ub: Optional[float], 2025-05-07T20:32:52.3010725Z contiguous: bool, 2025-05-07T20:32:52.3010839Z compiled: bool, 2025-05-07T20:32:52.3010926Z ) -> None: 2025-05-07T20:32:52.3011030Z torch.manual_seed(2025) 2025-05-07T20:32:52.3011117Z 2025-05-07T20:32:52.3011296Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3011386Z 2025-05-07T20:32:52.3011485Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3011614Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3011714Z x = x_sign * x_clamp 2025-05-07T20:32:52.3029379Z x0 = x[:, :D] 2025-05-07T20:32:52.3029499Z x1 = x[:, D:] 2025-05-07T20:32:52.3029583Z 2025-05-07T20:32:52.3029676Z if contiguous: 2025-05-07T20:32:52.3029783Z x0 = x0.contiguous() 2025-05-07T20:32:52.3029887Z x1 = x1.contiguous() 2025-05-07T20:32:52.3029965Z 2025-05-07T20:32:52.3030059Z if scale_ub is not None: 2025-05-07T20:32:52.3030180Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3030327Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3030406Z ) 2025-05-07T20:32:52.3030501Z else: 2025-05-07T20:32:52.3030600Z scale_ub_tensor = None 2025-05-07T20:32:52.3030675Z 2025-05-07T20:32:52.3030823Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3030917Z op = silu_mul_quant 2025-05-07T20:32:52.3031011Z if compiled: 2025-05-07T20:32:52.3031116Z op = torch.compile(op) 2025-05-07T20:32:52.3031225Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3031309Z 2025-05-07T20:32:52.3031404Z y_fp8, y_scale = fn() 2025-05-07T20:32:52.3031534Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:52.3031619Z 2025-05-07T20:32:52.3031760Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3031865Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:52.3031975Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:52.3032105Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:52.3032259Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.3032334Z 2025-05-07T20:32:52.3032437Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:52.3032443Z 2025-05-07T20:32:52.3032555Z moe/activation_test.py:126: 2025-05-07T20:32:52.3032690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3032801Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:52.3032945Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.3033721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:52.3033837Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:52.3034261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3034614Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3035009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:52.3035274Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.3035670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:52.3035855Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:52.3036220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:52.3036311Z fn() 2025-05-07T20:32:52.3036732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:52.3036817Z self.fn.run( 2025-05-07T20:32:52.3037185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3037282Z kernel = self.compile( 2025-05-07T20:32:52.3037682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3037876Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3038008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3038014Z 2025-05-07T20:32:52.3038235Z self = 2025-05-07T20:32:52.3039050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3039576Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127b0a9f80>} 2025-05-07T20:32:52.3040374Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3040571Z context = 2025-05-07T20:32:52.3040575Z 2025-05-07T20:32:52.3040755Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3041030Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3041155Z module_map=module_map) 2025-05-07T20:32:52.3041326Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3041430Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:52.3041521Z E ^ 2025-05-07T20:32:52.3041893Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3041902Z 2025-05-07T20:32:52.3042338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3042350Z 2025-05-07T20:32:52.3042456Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3042683Z self=, 2025-05-07T20:32:52.3042776Z T=128, 2025-05-07T20:32:52.3042855Z D=7168, 2025-05-07T20:32:52.3042939Z scale_ub=None, 2025-05-07T20:32:52.3043035Z contiguous=False, 2025-05-07T20:32:52.3043120Z compiled=False, 2025-05-07T20:32:52.3043195Z ) 2025-05-07T20:32:52.3043541Z self = 2025-05-07T20:32:52.3043721Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:52.3043726Z 2025-05-07T20:32:52.3043815Z @given( 2025-05-07T20:32:52.3044011Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3044112Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3044235Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3044356Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3044474Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3044556Z ) 2025-05-07T20:32:52.3044808Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3044906Z def test_silu_mul_quant( 2025-05-07T20:32:52.3044992Z self, 2025-05-07T20:32:52.3045070Z T: int, 2025-05-07T20:32:52.3045148Z D: int, 2025-05-07T20:32:52.3045261Z scale_ub: Optional[float], 2025-05-07T20:32:52.3045352Z contiguous: bool, 2025-05-07T20:32:52.3045444Z compiled: bool, 2025-05-07T20:32:52.3045524Z ) -> None: 2025-05-07T20:32:52.3045625Z torch.manual_seed(2025) 2025-05-07T20:32:52.3045720Z 2025-05-07T20:32:52.3045899Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3045979Z 2025-05-07T20:32:52.3046087Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3046217Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3046315Z x = x_sign * x_clamp 2025-05-07T20:32:52.3046406Z x0 = x[:, :D] 2025-05-07T20:32:52.3046493Z x1 = x[:, D:] 2025-05-07T20:32:52.3046571Z 2025-05-07T20:32:52.3046669Z if contiguous: 2025-05-07T20:32:52.3046771Z x0 = x0.contiguous() 2025-05-07T20:32:52.3046869Z x1 = x1.contiguous() 2025-05-07T20:32:52.3046961Z 2025-05-07T20:32:52.3047063Z if scale_ub is not None: 2025-05-07T20:32:52.3047182Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3047325Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3047403Z ) 2025-05-07T20:32:52.3047497Z else: 2025-05-07T20:32:52.3047602Z scale_ub_tensor = None 2025-05-07T20:32:52.3047677Z 2025-05-07T20:32:52.3047818Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3047910Z op = silu_mul_quant 2025-05-07T20:32:52.3047999Z if compiled: 2025-05-07T20:32:52.3048117Z op = torch.compile(op) 2025-05-07T20:32:52.3048228Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3048305Z 2025-05-07T20:32:52.3048405Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3048409Z 2025-05-07T20:32:52.3048509Z moe/activation_test.py:117: 2025-05-07T20:32:52.3048647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3048756Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3048861Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3049397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3049503Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3049878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3050118Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3050475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3050578Z kernel = self.compile( 2025-05-07T20:32:52.3050984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3051164Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3051398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3051404Z 2025-05-07T20:32:52.3051615Z self = 2025-05-07T20:32:52.3052512Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3053036Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ab52340>} 2025-05-07T20:32:52.3053831Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3054045Z context = 2025-05-07T20:32:52.3054049Z 2025-05-07T20:32:52.3054221Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3054684Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3054804Z module_map=module_map) 2025-05-07T20:32:52.3054970Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3055081Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3055164Z E ^ 2025-05-07T20:32:52.3055546Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3055551Z 2025-05-07T20:32:52.3055986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3055991Z 2025-05-07T20:32:52.3056097Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3056337Z self=, 2025-05-07T20:32:52.3056419Z T=4096, 2025-05-07T20:32:52.3056502Z D=5120, 2025-05-07T20:32:52.3056598Z scale_ub=1200.0, 2025-05-07T20:32:52.3056688Z contiguous=True, 2025-05-07T20:32:52.3056794Z compiled=False, 2025-05-07T20:32:52.3056875Z ) 2025-05-07T20:32:52.3057102Z self = 2025-05-07T20:32:52.3057290Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:52.3057294Z 2025-05-07T20:32:52.3057378Z @given( 2025-05-07T20:32:52.3057498Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3057603Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3057719Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3057838Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3057961Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3058044Z ) 2025-05-07T20:32:52.3058310Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3058407Z def test_silu_mul_quant( 2025-05-07T20:32:52.3058484Z self, 2025-05-07T20:32:52.3058575Z T: int, 2025-05-07T20:32:52.3058654Z D: int, 2025-05-07T20:32:52.3058752Z scale_ub: Optional[float], 2025-05-07T20:32:52.3058849Z contiguous: bool, 2025-05-07T20:32:52.3058935Z compiled: bool, 2025-05-07T20:32:52.3059014Z ) -> None: 2025-05-07T20:32:52.3059120Z torch.manual_seed(2025) 2025-05-07T20:32:52.3059201Z 2025-05-07T20:32:52.3059377Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3059467Z 2025-05-07T20:32:52.3059566Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3059699Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3059796Z x = x_sign * x_clamp 2025-05-07T20:32:52.3059881Z x0 = x[:, :D] 2025-05-07T20:32:52.3060086Z x1 = x[:, D:] 2025-05-07T20:32:52.3060169Z 2025-05-07T20:32:52.3060256Z if contiguous: 2025-05-07T20:32:52.3060363Z x0 = x0.contiguous() 2025-05-07T20:32:52.3060454Z x1 = x1.contiguous() 2025-05-07T20:32:52.3060607Z 2025-05-07T20:32:52.3060709Z if scale_ub is not None: 2025-05-07T20:32:52.3060819Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3060958Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3061046Z ) 2025-05-07T20:32:52.3061131Z else: 2025-05-07T20:32:52.3061237Z scale_ub_tensor = None 2025-05-07T20:32:52.3061315Z 2025-05-07T20:32:52.3061446Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3061547Z op = silu_mul_quant 2025-05-07T20:32:52.3061630Z if compiled: 2025-05-07T20:32:52.3061728Z op = torch.compile(op) 2025-05-07T20:32:52.3061848Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3061924Z 2025-05-07T20:32:52.3062022Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3062027Z 2025-05-07T20:32:52.3062132Z moe/activation_test.py:117: 2025-05-07T20:32:52.3062263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3062369Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3062478Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3063004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3063111Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3063487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3063723Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3064093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3064191Z kernel = self.compile( 2025-05-07T20:32:52.3064600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3064789Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3064922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3064926Z 2025-05-07T20:32:52.3065145Z self = 2025-05-07T20:32:52.3065951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3066479Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ab51440>} 2025-05-07T20:32:52.3067267Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3067467Z context = 2025-05-07T20:32:52.3067472Z 2025-05-07T20:32:52.3067649Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3067920Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3068037Z module_map=module_map) 2025-05-07T20:32:52.3068199Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3068298Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3068386Z E ^ 2025-05-07T20:32:52.3068841Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3068846Z 2025-05-07T20:32:52.3069294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3069299Z 2025-05-07T20:32:52.3069484Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3069720Z self=, 2025-05-07T20:32:52.3069813Z T=1, 2025-05-07T20:32:52.3069898Z D=5120, 2025-05-07T20:32:52.3069988Z scale_ub=None, 2025-05-07T20:32:52.3070085Z contiguous=True, 2025-05-07T20:32:52.3070175Z compiled=True, 2025-05-07T20:32:52.3070255Z ) 2025-05-07T20:32:52.3070486Z self = 2025-05-07T20:32:52.3070654Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:52.3070659Z 2025-05-07T20:32:52.3070747Z @given( 2025-05-07T20:32:52.3070876Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3070981Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3071108Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3071232Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3071356Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3071444Z ) 2025-05-07T20:32:52.3071699Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3071798Z def test_silu_mul_quant( 2025-05-07T20:32:52.3071890Z self, 2025-05-07T20:32:52.3071972Z T: int, 2025-05-07T20:32:52.3072063Z D: int, 2025-05-07T20:32:52.3072164Z scale_ub: Optional[float], 2025-05-07T20:32:52.3072258Z contiguous: bool, 2025-05-07T20:32:52.3072358Z compiled: bool, 2025-05-07T20:32:52.3072440Z ) -> None: 2025-05-07T20:32:52.3072534Z torch.manual_seed(2025) 2025-05-07T20:32:52.3072616Z 2025-05-07T20:32:52.3072792Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3072867Z 2025-05-07T20:32:52.3072966Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3073091Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3073182Z x = x_sign * x_clamp 2025-05-07T20:32:52.3073276Z x0 = x[:, :D] 2025-05-07T20:32:52.3073358Z x1 = x[:, D:] 2025-05-07T20:32:52.3073431Z 2025-05-07T20:32:52.3073521Z if contiguous: 2025-05-07T20:32:52.3073615Z x0 = x0.contiguous() 2025-05-07T20:32:52.3073712Z x1 = x1.contiguous() 2025-05-07T20:32:52.3073786Z 2025-05-07T20:32:52.3073877Z if scale_ub is not None: 2025-05-07T20:32:52.3073994Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3074129Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3074202Z ) 2025-05-07T20:32:52.3074294Z else: 2025-05-07T20:32:52.3074405Z scale_ub_tensor = None 2025-05-07T20:32:52.3074496Z 2025-05-07T20:32:52.3074648Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3074741Z op = silu_mul_quant 2025-05-07T20:32:52.3074827Z if compiled: 2025-05-07T20:32:52.3074935Z op = torch.compile(op) 2025-05-07T20:32:52.3075046Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3075125Z 2025-05-07T20:32:52.3075217Z y_fp8, y_scale = fn() 2025-05-07T20:32:52.3075341Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:52.3075423Z 2025-05-07T20:32:52.3075560Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3075663Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:52.3075774Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:52.3075898Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:52.3076040Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.3076126Z 2025-05-07T20:32:52.3076353Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:52.3076358Z 2025-05-07T20:32:52.3076470Z moe/activation_test.py:126: 2025-05-07T20:32:52.3076603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3076786Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:52.3076930Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.3077518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:52.3077621Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:52.3078003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3078233Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3078639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:52.3078906Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.3079301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:52.3079486Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:52.3079849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:52.3079939Z fn() 2025-05-07T20:32:52.3080363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:52.3080449Z self.fn.run( 2025-05-07T20:32:52.3080816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3080911Z kernel = self.compile( 2025-05-07T20:32:52.3081313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3081498Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3081628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3081638Z 2025-05-07T20:32:52.3081853Z self = 2025-05-07T20:32:52.3082659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3083172Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ab53060>} 2025-05-07T20:32:52.3083978Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3084183Z context = 2025-05-07T20:32:52.3084189Z 2025-05-07T20:32:52.3084392Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3084676Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3084780Z module_map=module_map) 2025-05-07T20:32:52.3084948Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3085051Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:52.3085132Z E ^ 2025-05-07T20:32:52.3085495Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3085500Z 2025-05-07T20:32:52.3086012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3086017Z 2025-05-07T20:32:52.3086131Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3086356Z self=, 2025-05-07T20:32:52.3086543Z T=2048, 2025-05-07T20:32:52.3086621Z D=5120, 2025-05-07T20:32:52.3086704Z scale_ub=None, 2025-05-07T20:32:52.3086797Z contiguous=True, 2025-05-07T20:32:52.3086879Z compiled=True, 2025-05-07T20:32:52.3086950Z ) 2025-05-07T20:32:52.3087184Z self = 2025-05-07T20:32:52.3087358Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:52.3087363Z 2025-05-07T20:32:52.3087442Z @given( 2025-05-07T20:32:52.3087564Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3087667Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3087782Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3087915Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3088026Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3088111Z ) 2025-05-07T20:32:52.3088359Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3088461Z def test_silu_mul_quant( 2025-05-07T20:32:52.3088548Z self, 2025-05-07T20:32:52.3088625Z T: int, 2025-05-07T20:32:52.3088704Z D: int, 2025-05-07T20:32:52.3088813Z scale_ub: Optional[float], 2025-05-07T20:32:52.3088900Z contiguous: bool, 2025-05-07T20:32:52.3088988Z compiled: bool, 2025-05-07T20:32:52.3089077Z ) -> None: 2025-05-07T20:32:52.3089169Z torch.manual_seed(2025) 2025-05-07T20:32:52.3089245Z 2025-05-07T20:32:52.3089427Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3089505Z 2025-05-07T20:32:52.3089611Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3089742Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3089830Z x = x_sign * x_clamp 2025-05-07T20:32:52.3089912Z x0 = x[:, :D] 2025-05-07T20:32:52.3089990Z x1 = x[:, D:] 2025-05-07T20:32:52.3090062Z 2025-05-07T20:32:52.3090161Z if contiguous: 2025-05-07T20:32:52.3090250Z x0 = x0.contiguous() 2025-05-07T20:32:52.3090339Z x1 = x1.contiguous() 2025-05-07T20:32:52.3090417Z 2025-05-07T20:32:52.3090509Z if scale_ub is not None: 2025-05-07T20:32:52.3090616Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3090754Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3090829Z ) 2025-05-07T20:32:52.3090908Z else: 2025-05-07T20:32:52.3090998Z scale_ub_tensor = None 2025-05-07T20:32:52.3091065Z 2025-05-07T20:32:52.3091199Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3091292Z op = silu_mul_quant 2025-05-07T20:32:52.3091374Z if compiled: 2025-05-07T20:32:52.3091471Z op = torch.compile(op) 2025-05-07T20:32:52.3091581Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3091654Z 2025-05-07T20:32:52.3091750Z y_fp8, y_scale = fn() 2025-05-07T20:32:52.3091880Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:52.3091952Z 2025-05-07T20:32:52.3092090Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3092203Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:52.3092303Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:52.3092423Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:52.3092576Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.3092651Z 2025-05-07T20:32:52.3092759Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:52.3092764Z 2025-05-07T20:32:52.3092859Z moe/activation_test.py:126: 2025-05-07T20:32:52.3093076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3093188Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:52.3093321Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.3093987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:52.3094098Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:52.3094557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3094799Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3095186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:52.3095459Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.3095862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:52.3096033Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:52.3096406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:52.3096488Z fn() 2025-05-07T20:32:52.3096907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:52.3097029Z self.fn.run( 2025-05-07T20:32:52.3097417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3097542Z kernel = self.compile( 2025-05-07T20:32:52.3098070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3098371Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3098537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3098542Z 2025-05-07T20:32:52.3098822Z self = 2025-05-07T20:32:52.3099666Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3100196Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1279cf4860>} 2025-05-07T20:32:52.3101149Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3101386Z context = 2025-05-07T20:32:52.3101391Z 2025-05-07T20:32:52.3101633Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3101944Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3102114Z module_map=module_map) 2025-05-07T20:32:52.3102384Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3102559Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:52.3102707Z E ^ 2025-05-07T20:32:52.3103117Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3103122Z 2025-05-07T20:32:52.3103591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3103637Z 2025-05-07T20:32:52.3103795Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3104220Z self=, 2025-05-07T20:32:52.3104385Z T=128, 2025-05-07T20:32:52.3104494Z D=5120, 2025-05-07T20:32:52.3104613Z scale_ub=None, 2025-05-07T20:32:52.3104771Z contiguous=True, 2025-05-07T20:32:52.3104975Z compiled=True, 2025-05-07T20:32:52.3105136Z ) 2025-05-07T20:32:52.3105447Z self = 2025-05-07T20:32:52.3105651Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:52.3105656Z 2025-05-07T20:32:52.3105819Z @given( 2025-05-07T20:32:52.3105975Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3106092Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3106343Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3106494Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3106646Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3106825Z ) 2025-05-07T20:32:52.3107115Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3107321Z def test_silu_mul_quant( 2025-05-07T20:32:52.3107456Z self, 2025-05-07T20:32:52.3107571Z T: int, 2025-05-07T20:32:52.3107770Z D: int, 2025-05-07T20:32:52.3107901Z scale_ub: Optional[float], 2025-05-07T20:32:52.3108024Z contiguous: bool, 2025-05-07T20:32:52.3108217Z compiled: bool, 2025-05-07T20:32:52.3108348Z ) -> None: 2025-05-07T20:32:52.3108477Z torch.manual_seed(2025) 2025-05-07T20:32:52.3108644Z 2025-05-07T20:32:52.3108850Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3108954Z 2025-05-07T20:32:52.3109154Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3109332Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3109512Z x = x_sign * x_clamp 2025-05-07T20:32:52.3109629Z x0 = x[:, :D] 2025-05-07T20:32:52.3109751Z x1 = x[:, D:] 2025-05-07T20:32:52.3109882Z 2025-05-07T20:32:52.3110053Z if contiguous: 2025-05-07T20:32:52.3110198Z x0 = x0.contiguous() 2025-05-07T20:32:52.3110380Z x1 = x1.contiguous() 2025-05-07T20:32:52.3110495Z 2025-05-07T20:32:52.3110621Z if scale_ub is not None: 2025-05-07T20:32:52.3110781Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3111005Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3111183Z ) 2025-05-07T20:32:52.3111293Z else: 2025-05-07T20:32:52.3111463Z scale_ub_tensor = None 2025-05-07T20:32:52.3111602Z 2025-05-07T20:32:52.3111745Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3111921Z op = silu_mul_quant 2025-05-07T20:32:52.3112107Z if compiled: 2025-05-07T20:32:52.3112238Z op = torch.compile(op) 2025-05-07T20:32:52.3112378Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3112518Z 2025-05-07T20:32:52.3112624Z y_fp8, y_scale = fn() 2025-05-07T20:32:52.3112921Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:52.3113026Z 2025-05-07T20:32:52.3113199Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3113376Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:52.3113509Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:52.3113648Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:52.3114001Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.3114112Z 2025-05-07T20:32:52.3114244Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:52.3114249Z 2025-05-07T20:32:52.3114421Z moe/activation_test.py:126: 2025-05-07T20:32:52.3114586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3114908Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:52.3115101Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.3115769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:52.3116025Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:52.3116439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3116745Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3117216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:52.3117538Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.3118008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:52.3118219Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:52.3118639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:52.3118772Z fn() 2025-05-07T20:32:52.3119282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:52.3119455Z self.fn.run( 2025-05-07T20:32:52.3119845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3119975Z kernel = self.compile( 2025-05-07T20:32:52.3120470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3120667Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3120937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3120942Z 2025-05-07T20:32:52.3121197Z self = 2025-05-07T20:32:52.3122046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3122661Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127a6a5f80>} 2025-05-07T20:32:52.3123483Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3123781Z context = 2025-05-07T20:32:52.3123786Z 2025-05-07T20:32:52.3124061Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3124409Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3124574Z module_map=module_map) 2025-05-07T20:32:52.3124778Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3124931Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:52.3125089Z E ^ 2025-05-07T20:32:52.3125731Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3125739Z 2025-05-07T20:32:52.3126322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3126327Z 2025-05-07T20:32:52.3126467Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3126761Z self=, 2025-05-07T20:32:52.3126858Z T=4096, 2025-05-07T20:32:52.3127170Z D=5120, 2025-05-07T20:32:52.3127361Z scale_ub=None, 2025-05-07T20:32:52.3127482Z contiguous=True, 2025-05-07T20:32:52.3127595Z compiled=True, 2025-05-07T20:32:52.3127736Z ) 2025-05-07T20:32:52.3127977Z self = 2025-05-07T20:32:52.3128422Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:52.3128427Z 2025-05-07T20:32:52.3128539Z @given( 2025-05-07T20:32:52.3128691Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3128863Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3129093Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3129229Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3129505Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3129617Z ) 2025-05-07T20:32:52.3129946Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3130080Z def test_silu_mul_quant( 2025-05-07T20:32:52.3130192Z self, 2025-05-07T20:32:52.3130394Z T: int, 2025-05-07T20:32:52.3130521Z D: int, 2025-05-07T20:32:52.3130657Z scale_ub: Optional[float], 2025-05-07T20:32:52.3130820Z contiguous: bool, 2025-05-07T20:32:52.3130939Z compiled: bool, 2025-05-07T20:32:52.3131050Z ) -> None: 2025-05-07T20:32:52.3131273Z torch.manual_seed(2025) 2025-05-07T20:32:52.3131392Z 2025-05-07T20:32:52.3131596Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3131743Z 2025-05-07T20:32:52.3131873Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3132075Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3132248Z x = x_sign * x_clamp 2025-05-07T20:32:52.3132380Z x0 = x[:, :D] 2025-05-07T20:32:52.3132528Z x1 = x[:, D:] 2025-05-07T20:32:52.3132636Z 2025-05-07T20:32:52.3132758Z if contiguous: 2025-05-07T20:32:52.3133003Z x0 = x0.contiguous() 2025-05-07T20:32:52.3133180Z x1 = x1.contiguous() 2025-05-07T20:32:52.3133302Z 2025-05-07T20:32:52.3133462Z if scale_ub is not None: 2025-05-07T20:32:52.3133600Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3133837Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3133969Z ) 2025-05-07T20:32:52.3134158Z else: 2025-05-07T20:32:52.3134338Z scale_ub_tensor = None 2025-05-07T20:32:52.3134526Z 2025-05-07T20:32:52.3134691Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3134871Z op = silu_mul_quant 2025-05-07T20:32:52.3134972Z if compiled: 2025-05-07T20:32:52.3135175Z op = torch.compile(op) 2025-05-07T20:32:52.3135371Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3135476Z 2025-05-07T20:32:52.3135622Z y_fp8, y_scale = fn() 2025-05-07T20:32:52.3135820Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:52.3135908Z 2025-05-07T20:32:52.3136181Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3136316Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:52.3136456Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:52.3136762Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:52.3136935Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.3137027Z 2025-05-07T20:32:52.3137264Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:52.3137269Z 2025-05-07T20:32:52.3137400Z moe/activation_test.py:126: 2025-05-07T20:32:52.3137622Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3137762Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:52.3137930Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.3138714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:52.3138867Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:52.3139335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3139673Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3140093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:52.3140413Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.3140891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:52.3141131Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:52.3141567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:52.3141678Z fn() 2025-05-07T20:32:52.3142170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:52.3142277Z self.fn.run( 2025-05-07T20:32:52.3142736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3142916Z kernel = self.compile( 2025-05-07T20:32:52.3143443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3143687Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3143848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3143853Z 2025-05-07T20:32:52.3144079Z self = 2025-05-07T20:32:52.3145054Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3145601Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127a002520>} 2025-05-07T20:32:52.3146464Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3146691Z context = 2025-05-07T20:32:52.3146696Z 2025-05-07T20:32:52.3146895Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3147298Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3147464Z module_map=module_map) 2025-05-07T20:32:52.3147697Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3147834Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:52.3147942Z E ^ 2025-05-07T20:32:52.3148395Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3148399Z 2025-05-07T20:32:52.3148916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3148921Z 2025-05-07T20:32:52.3149109Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3149369Z self=, 2025-05-07T20:32:52.3149481Z T=16384, 2025-05-07T20:32:52.3149651Z D=5120, 2025-05-07T20:32:52.3149753Z scale_ub=None, 2025-05-07T20:32:52.3149921Z contiguous=True, 2025-05-07T20:32:52.3150270Z compiled=True, 2025-05-07T20:32:52.3150378Z ) 2025-05-07T20:32:52.3150692Z self = 2025-05-07T20:32:52.3150899Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:52.3150979Z 2025-05-07T20:32:52.3151074Z @given( 2025-05-07T20:32:52.3151331Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3151467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3151641Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3151825Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3151971Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3152159Z ) 2025-05-07T20:32:52.3152463Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3152612Z def test_silu_mul_quant( 2025-05-07T20:32:52.3152756Z self, 2025-05-07T20:32:52.3152869Z T: int, 2025-05-07T20:32:52.3152983Z D: int, 2025-05-07T20:32:52.3153201Z scale_ub: Optional[float], 2025-05-07T20:32:52.3153340Z contiguous: bool, 2025-05-07T20:32:52.3153481Z compiled: bool, 2025-05-07T20:32:52.3153634Z ) -> None: 2025-05-07T20:32:52.3153784Z torch.manual_seed(2025) 2025-05-07T20:32:52.3153901Z 2025-05-07T20:32:52.3154262Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3154408Z 2025-05-07T20:32:52.3154567Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3154724Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3154848Z x = x_sign * x_clamp 2025-05-07T20:32:52.3154989Z x0 = x[:, :D] 2025-05-07T20:32:52.3155153Z x1 = x[:, D:] 2025-05-07T20:32:52.3155301Z 2025-05-07T20:32:52.3155468Z if contiguous: 2025-05-07T20:32:52.3155597Z x0 = x0.contiguous() 2025-05-07T20:32:52.3155721Z x1 = x1.contiguous() 2025-05-07T20:32:52.3155856Z 2025-05-07T20:32:52.3156038Z if scale_ub is not None: 2025-05-07T20:32:52.3156264Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3156437Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3156550Z ) 2025-05-07T20:32:52.3156705Z else: 2025-05-07T20:32:52.3156818Z scale_ub_tensor = None 2025-05-07T20:32:52.3156994Z 2025-05-07T20:32:52.3157214Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3157340Z op = silu_mul_quant 2025-05-07T20:32:52.3157461Z if compiled: 2025-05-07T20:32:52.3157635Z op = torch.compile(op) 2025-05-07T20:32:52.3157759Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3158062Z 2025-05-07T20:32:52.3158187Z y_fp8, y_scale = fn() 2025-05-07T20:32:52.3158344Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:52.3158492Z 2025-05-07T20:32:52.3158674Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3158817Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:52.3159056Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:52.3159213Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:52.3159397Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.3159548Z 2025-05-07T20:32:52.3159708Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:52.3159713Z 2025-05-07T20:32:52.3159914Z moe/activation_test.py:126: 2025-05-07T20:32:52.3160101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3160242Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:52.3160448Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.3161072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:52.3161316Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:52.3161799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3162080Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3162650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:52.3162951Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.3163443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:52.3163668Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:52.3164133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:52.3164317Z fn() 2025-05-07T20:32:52.3164795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:52.3164913Z self.fn.run( 2025-05-07T20:32:52.3165340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3165457Z kernel = self.compile( 2025-05-07T20:32:52.3165993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3166228Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3166391Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3166396Z 2025-05-07T20:32:52.3166670Z self = 2025-05-07T20:32:52.3167516Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3168133Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1279963420>} 2025-05-07T20:32:52.3168979Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3169226Z context = 2025-05-07T20:32:52.3169269Z 2025-05-07T20:32:52.3169469Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3169772Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3169931Z module_map=module_map) 2025-05-07T20:32:52.3170183Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3170353Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:52.3170501Z E ^ 2025-05-07T20:32:52.3170973Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3170983Z 2025-05-07T20:32:52.3171485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3171490Z 2025-05-07T20:32:52.3171613Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3171926Z self=, 2025-05-07T20:32:52.3172114Z T=1, 2025-05-07T20:32:52.3172224Z D=5120, 2025-05-07T20:32:52.3172341Z scale_ub=1200.0, 2025-05-07T20:32:52.3172497Z contiguous=True, 2025-05-07T20:32:52.3172598Z compiled=True, 2025-05-07T20:32:52.3172829Z ) 2025-05-07T20:32:52.3173086Z self = 2025-05-07T20:32:52.3173369Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:52.3173374Z 2025-05-07T20:32:52.3173521Z @given( 2025-05-07T20:32:52.3173678Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3173893Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3174167Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3174318Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3174638Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3174770Z ) 2025-05-07T20:32:52.3175054Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3175265Z def test_silu_mul_quant( 2025-05-07T20:32:52.3175391Z self, 2025-05-07T20:32:52.3175564Z T: int, 2025-05-07T20:32:52.3175712Z D: int, 2025-05-07T20:32:52.3175869Z scale_ub: Optional[float], 2025-05-07T20:32:52.3175990Z contiguous: bool, 2025-05-07T20:32:52.3176182Z compiled: bool, 2025-05-07T20:32:52.3176308Z ) -> None: 2025-05-07T20:32:52.3176434Z torch.manual_seed(2025) 2025-05-07T20:32:52.3176599Z 2025-05-07T20:32:52.3176802Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3176935Z 2025-05-07T20:32:52.3177106Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3177278Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3177435Z x = x_sign * x_clamp 2025-05-07T20:32:52.3177572Z x0 = x[:, :D] 2025-05-07T20:32:52.3177686Z x1 = x[:, D:] 2025-05-07T20:32:52.3177790Z 2025-05-07T20:32:52.3182924Z if contiguous: 2025-05-07T20:32:52.3183038Z x0 = x0.contiguous() 2025-05-07T20:32:52.3183136Z x1 = x1.contiguous() 2025-05-07T20:32:52.3183217Z 2025-05-07T20:32:52.3183313Z if scale_ub is not None: 2025-05-07T20:32:52.3183435Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3183586Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3183668Z ) 2025-05-07T20:32:52.3183754Z else: 2025-05-07T20:32:52.3183852Z scale_ub_tensor = None 2025-05-07T20:32:52.3183936Z 2025-05-07T20:32:52.3184079Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3184174Z op = silu_mul_quant 2025-05-07T20:32:52.3184263Z if compiled: 2025-05-07T20:32:52.3184367Z op = torch.compile(op) 2025-05-07T20:32:52.3184475Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3184553Z 2025-05-07T20:32:52.3184647Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3184652Z 2025-05-07T20:32:52.3184758Z moe/activation_test.py:117: 2025-05-07T20:32:52.3184893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3184996Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3185113Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3185511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3185608Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3186133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3186241Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3186624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3186852Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3187208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3187310Z kernel = self.compile( 2025-05-07T20:32:52.3187712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3187996Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3188140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3188145Z 2025-05-07T20:32:52.3188429Z self = 2025-05-07T20:32:52.3189244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3189765Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1279470180>} 2025-05-07T20:32:52.3190566Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3190760Z context = 2025-05-07T20:32:52.3190765Z 2025-05-07T20:32:52.3190933Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3191212Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3191323Z module_map=module_map) 2025-05-07T20:32:52.3191497Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3191599Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3191677Z E ^ 2025-05-07T20:32:52.3192052Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3192057Z 2025-05-07T20:32:52.3192493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3192497Z 2025-05-07T20:32:52.3192609Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3192843Z self=, 2025-05-07T20:32:52.3192925Z T=1, 2025-05-07T20:32:52.3193006Z D=5120, 2025-05-07T20:32:52.3193101Z scale_ub=None, 2025-05-07T20:32:52.3193189Z contiguous=False, 2025-05-07T20:32:52.3193281Z compiled=True, 2025-05-07T20:32:52.3193355Z ) 2025-05-07T20:32:52.3193579Z self = 2025-05-07T20:32:52.3193751Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.3193756Z 2025-05-07T20:32:52.3193833Z @given( 2025-05-07T20:32:52.3193955Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3194063Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3194182Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3194311Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3194430Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3194507Z ) 2025-05-07T20:32:52.3194765Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3194863Z def test_silu_mul_quant( 2025-05-07T20:32:52.3194950Z self, 2025-05-07T20:32:52.3195037Z T: int, 2025-05-07T20:32:52.3195116Z D: int, 2025-05-07T20:32:52.3195217Z scale_ub: Optional[float], 2025-05-07T20:32:52.3195315Z contiguous: bool, 2025-05-07T20:32:52.3195403Z compiled: bool, 2025-05-07T20:32:52.3195484Z ) -> None: 2025-05-07T20:32:52.3195583Z torch.manual_seed(2025) 2025-05-07T20:32:52.3195658Z 2025-05-07T20:32:52.3195836Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3195911Z 2025-05-07T20:32:52.3196008Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3196138Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3196314Z x = x_sign * x_clamp 2025-05-07T20:32:52.3196402Z x0 = x[:, :D] 2025-05-07T20:32:52.3196497Z x1 = x[:, D:] 2025-05-07T20:32:52.3196576Z 2025-05-07T20:32:52.3196668Z if contiguous: 2025-05-07T20:32:52.3196766Z x0 = x0.contiguous() 2025-05-07T20:32:52.3196959Z x1 = x1.contiguous() 2025-05-07T20:32:52.3197033Z 2025-05-07T20:32:52.3197128Z if scale_ub is not None: 2025-05-07T20:32:52.3197237Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3197377Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3197450Z ) 2025-05-07T20:32:52.3197526Z else: 2025-05-07T20:32:52.3197628Z scale_ub_tensor = None 2025-05-07T20:32:52.3197702Z 2025-05-07T20:32:52.3197838Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3197935Z op = silu_mul_quant 2025-05-07T20:32:52.3198020Z if compiled: 2025-05-07T20:32:52.3198127Z op = torch.compile(op) 2025-05-07T20:32:52.3198241Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3198316Z 2025-05-07T20:32:52.3198408Z y_fp8, y_scale = fn() 2025-05-07T20:32:52.3198535Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:52.3198615Z 2025-05-07T20:32:52.3198756Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3198870Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:52.3198973Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:52.3199103Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:52.3199246Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.3199322Z 2025-05-07T20:32:52.3199428Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:52.3199432Z 2025-05-07T20:32:52.3199531Z moe/activation_test.py:126: 2025-05-07T20:32:52.3199671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3199786Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:52.3199924Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.3200523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:52.3200632Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:52.3201014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3201253Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3201779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:52.3202069Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.3202471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:52.3202639Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:52.3203003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:52.3203089Z fn() 2025-05-07T20:32:52.3203511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:52.3203603Z self.fn.run( 2025-05-07T20:32:52.3203960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3204058Z kernel = self.compile( 2025-05-07T20:32:52.3204487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3204686Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3204917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3204922Z 2025-05-07T20:32:52.3205128Z self = 2025-05-07T20:32:52.3205945Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3206538Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1279473240>} 2025-05-07T20:32:52.3207333Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3207534Z context = 2025-05-07T20:32:52.3207539Z 2025-05-07T20:32:52.3207714Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3207995Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3208104Z module_map=module_map) 2025-05-07T20:32:52.3208274Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3208389Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:52.3208473Z E ^ 2025-05-07T20:32:52.3208857Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3208869Z 2025-05-07T20:32:52.3209307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3209312Z 2025-05-07T20:32:52.3209421Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3209653Z self=, 2025-05-07T20:32:52.3209741Z T=1, 2025-05-07T20:32:52.3209822Z D=5120, 2025-05-07T20:32:52.3209912Z scale_ub=None, 2025-05-07T20:32:52.3210004Z contiguous=True, 2025-05-07T20:32:52.3210090Z compiled=False, 2025-05-07T20:32:52.3210179Z ) 2025-05-07T20:32:52.3210410Z self = 2025-05-07T20:32:52.3210587Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:52.3210591Z 2025-05-07T20:32:52.3210676Z @given( 2025-05-07T20:32:52.3210799Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3210912Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3211032Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3211151Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3211274Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3211355Z ) 2025-05-07T20:32:52.3211614Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3211724Z def test_silu_mul_quant( 2025-05-07T20:32:52.3211809Z self, 2025-05-07T20:32:52.3211898Z T: int, 2025-05-07T20:32:52.3211980Z D: int, 2025-05-07T20:32:52.3212089Z scale_ub: Optional[float], 2025-05-07T20:32:52.3212201Z contiguous: bool, 2025-05-07T20:32:52.3212319Z compiled: bool, 2025-05-07T20:32:52.3212423Z ) -> None: 2025-05-07T20:32:52.3212554Z torch.manual_seed(2025) 2025-05-07T20:32:52.3212649Z 2025-05-07T20:32:52.3212848Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3212926Z 2025-05-07T20:32:52.3213017Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3213139Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3213265Z x = x_sign * x_clamp 2025-05-07T20:32:52.3213375Z x0 = x[:, :D] 2025-05-07T20:32:52.3213492Z x1 = x[:, D:] 2025-05-07T20:32:52.3213685Z 2025-05-07T20:32:52.3213810Z if contiguous: 2025-05-07T20:32:52.3213925Z x0 = x0.contiguous() 2025-05-07T20:32:52.3214016Z x1 = x1.contiguous() 2025-05-07T20:32:52.3214089Z 2025-05-07T20:32:52.3214184Z if scale_ub is not None: 2025-05-07T20:32:52.3214520Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3214668Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3214754Z ) 2025-05-07T20:32:52.3214836Z else: 2025-05-07T20:32:52.3214935Z scale_ub_tensor = None 2025-05-07T20:32:52.3215018Z 2025-05-07T20:32:52.3215146Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3215235Z op = silu_mul_quant 2025-05-07T20:32:52.3215324Z if compiled: 2025-05-07T20:32:52.3215423Z op = torch.compile(op) 2025-05-07T20:32:52.3215533Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3215605Z 2025-05-07T20:32:52.3215698Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3215703Z 2025-05-07T20:32:52.3215803Z moe/activation_test.py:117: 2025-05-07T20:32:52.3215933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3216035Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3216140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3216668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3216772Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3217146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3217374Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3217734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3217830Z kernel = self.compile( 2025-05-07T20:32:52.3218228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3218411Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3218545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3218550Z 2025-05-07T20:32:52.3218760Z self = 2025-05-07T20:32:52.3219567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3220081Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127a001bc0>} 2025-05-07T20:32:52.3220878Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3221069Z context = 2025-05-07T20:32:52.3221079Z 2025-05-07T20:32:52.3221254Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3221523Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3221632Z module_map=module_map) 2025-05-07T20:32:52.3221795Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3221896Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3221978Z E ^ 2025-05-07T20:32:52.3222346Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3222351Z 2025-05-07T20:32:52.3222881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3222886Z 2025-05-07T20:32:52.3222999Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3223384Z self=, 2025-05-07T20:32:52.3223469Z T=128, 2025-05-07T20:32:52.3223544Z D=5120, 2025-05-07T20:32:52.3223629Z scale_ub=None, 2025-05-07T20:32:52.3223721Z contiguous=False, 2025-05-07T20:32:52.3223804Z compiled=True, 2025-05-07T20:32:52.3223878Z ) 2025-05-07T20:32:52.3224107Z self = 2025-05-07T20:32:52.3224279Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.3224284Z 2025-05-07T20:32:52.3224364Z @given( 2025-05-07T20:32:52.3224484Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3224592Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3224708Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3224825Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3224941Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3225029Z ) 2025-05-07T20:32:52.3225278Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3225371Z def test_silu_mul_quant( 2025-05-07T20:32:52.3225677Z self, 2025-05-07T20:32:52.3225757Z T: int, 2025-05-07T20:32:52.3225836Z D: int, 2025-05-07T20:32:52.3225940Z scale_ub: Optional[float], 2025-05-07T20:32:52.3226034Z contiguous: bool, 2025-05-07T20:32:52.3226123Z compiled: bool, 2025-05-07T20:32:52.3226210Z ) -> None: 2025-05-07T20:32:52.3226306Z torch.manual_seed(2025) 2025-05-07T20:32:52.3226383Z 2025-05-07T20:32:52.3226555Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3226636Z 2025-05-07T20:32:52.3226730Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3226858Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3226954Z x = x_sign * x_clamp 2025-05-07T20:32:52.3227044Z x0 = x[:, :D] 2025-05-07T20:32:52.3227136Z x1 = x[:, D:] 2025-05-07T20:32:52.3227211Z 2025-05-07T20:32:52.3227302Z if contiguous: 2025-05-07T20:32:52.3227394Z x0 = x0.contiguous() 2025-05-07T20:32:52.3227483Z x1 = x1.contiguous() 2025-05-07T20:32:52.3227560Z 2025-05-07T20:32:52.3227650Z if scale_ub is not None: 2025-05-07T20:32:52.3227767Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3227900Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3227976Z ) 2025-05-07T20:32:52.3228061Z else: 2025-05-07T20:32:52.3228156Z scale_ub_tensor = None 2025-05-07T20:32:52.3228231Z 2025-05-07T20:32:52.3228370Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3228463Z op = silu_mul_quant 2025-05-07T20:32:52.3228551Z if compiled: 2025-05-07T20:32:52.3228662Z op = torch.compile(op) 2025-05-07T20:32:52.3228767Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3228849Z 2025-05-07T20:32:52.3228949Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3228954Z 2025-05-07T20:32:52.3229051Z moe/activation_test.py:117: 2025-05-07T20:32:52.3229192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3229297Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3229399Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3229791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3229890Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3230581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3230696Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3231076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3231496Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3231854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3231957Z kernel = self.compile( 2025-05-07T20:32:52.3232367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3232549Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3232688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3232700Z 2025-05-07T20:32:52.3232918Z self = 2025-05-07T20:32:52.3233734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3234264Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1279459b20>} 2025-05-07T20:32:52.3235060Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3235265Z context = 2025-05-07T20:32:52.3235270Z 2025-05-07T20:32:52.3235439Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3235728Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3235844Z module_map=module_map) 2025-05-07T20:32:52.3236012Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3236129Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3236214Z E ^ 2025-05-07T20:32:52.3236584Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3236589Z 2025-05-07T20:32:52.3237033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3237037Z 2025-05-07T20:32:52.3237142Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3237377Z self=, 2025-05-07T20:32:52.3237465Z T=128, 2025-05-07T20:32:52.3237543Z D=7168, 2025-05-07T20:32:52.3237631Z scale_ub=1200.0, 2025-05-07T20:32:52.3237724Z contiguous=False, 2025-05-07T20:32:52.3237811Z compiled=False, 2025-05-07T20:32:52.3237892Z ) 2025-05-07T20:32:52.3238115Z self = 2025-05-07T20:32:52.3238290Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:52.3238299Z 2025-05-07T20:32:52.3238388Z @given( 2025-05-07T20:32:52.3238508Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3238613Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3238733Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3238850Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3238968Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3239046Z ) 2025-05-07T20:32:52.3239296Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3239391Z def test_silu_mul_quant( 2025-05-07T20:32:52.3239561Z self, 2025-05-07T20:32:52.3239647Z T: int, 2025-05-07T20:32:52.3239735Z D: int, 2025-05-07T20:32:52.3239838Z scale_ub: Optional[float], 2025-05-07T20:32:52.3239929Z contiguous: bool, 2025-05-07T20:32:52.3240101Z compiled: bool, 2025-05-07T20:32:52.3240183Z ) -> None: 2025-05-07T20:32:52.3240282Z torch.manual_seed(2025) 2025-05-07T20:32:52.3240361Z 2025-05-07T20:32:52.3240534Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3240615Z 2025-05-07T20:32:52.3240710Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3240832Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3240932Z x = x_sign * x_clamp 2025-05-07T20:32:52.3241015Z x0 = x[:, :D] 2025-05-07T20:32:52.3241095Z x1 = x[:, D:] 2025-05-07T20:32:52.3241177Z 2025-05-07T20:32:52.3241262Z if contiguous: 2025-05-07T20:32:52.3241363Z x0 = x0.contiguous() 2025-05-07T20:32:52.3241461Z x1 = x1.contiguous() 2025-05-07T20:32:52.3241531Z 2025-05-07T20:32:52.3241622Z if scale_ub is not None: 2025-05-07T20:32:52.3241734Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3241869Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3241960Z ) 2025-05-07T20:32:52.3242038Z else: 2025-05-07T20:32:52.3242128Z scale_ub_tensor = None 2025-05-07T20:32:52.3242210Z 2025-05-07T20:32:52.3242345Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3242438Z op = silu_mul_quant 2025-05-07T20:32:52.3242529Z if compiled: 2025-05-07T20:32:52.3242630Z op = torch.compile(op) 2025-05-07T20:32:52.3242736Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3242814Z 2025-05-07T20:32:52.3242904Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3242909Z 2025-05-07T20:32:52.3243006Z moe/activation_test.py:117: 2025-05-07T20:32:52.3243146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3243248Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3243353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3243882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3243984Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3244388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3244641Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3245004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3245098Z kernel = self.compile( 2025-05-07T20:32:52.3245502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3245687Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3245815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3245825Z 2025-05-07T20:32:52.3246037Z self = 2025-05-07T20:32:52.3246847Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3247373Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1278dbcea0>} 2025-05-07T20:32:52.3248248Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3248446Z context = 2025-05-07T20:32:52.3248450Z 2025-05-07T20:32:52.3248621Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3248975Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3249093Z module_map=module_map) 2025-05-07T20:32:52.3249258Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3249371Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3249453Z E ^ 2025-05-07T20:32:52.3249823Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3249827Z 2025-05-07T20:32:52.3250269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3250274Z 2025-05-07T20:32:52.3250378Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3250608Z self=, 2025-05-07T20:32:52.3250685Z T=128, 2025-05-07T20:32:52.3250767Z D=5120, 2025-05-07T20:32:52.3250850Z scale_ub=None, 2025-05-07T20:32:52.3250933Z contiguous=False, 2025-05-07T20:32:52.3251016Z compiled=False, 2025-05-07T20:32:52.3251093Z ) 2025-05-07T20:32:52.3251314Z self = 2025-05-07T20:32:52.3251487Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:52.3251491Z 2025-05-07T20:32:52.3251568Z @given( 2025-05-07T20:32:52.3251684Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3251785Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3251900Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3252021Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3252139Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3252215Z ) 2025-05-07T20:32:52.3252466Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3252567Z def test_silu_mul_quant( 2025-05-07T20:32:52.3252648Z self, 2025-05-07T20:32:52.3252720Z T: int, 2025-05-07T20:32:52.3252800Z D: int, 2025-05-07T20:32:52.3252898Z scale_ub: Optional[float], 2025-05-07T20:32:52.3252986Z contiguous: bool, 2025-05-07T20:32:52.3253072Z compiled: bool, 2025-05-07T20:32:52.3253147Z ) -> None: 2025-05-07T20:32:52.3253245Z torch.manual_seed(2025) 2025-05-07T20:32:52.3253317Z 2025-05-07T20:32:52.3253485Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3253563Z 2025-05-07T20:32:52.3253654Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3253779Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3253871Z x = x_sign * x_clamp 2025-05-07T20:32:52.3253950Z x0 = x[:, :D] 2025-05-07T20:32:52.3254028Z x1 = x[:, D:] 2025-05-07T20:32:52.3254102Z 2025-05-07T20:32:52.3254186Z if contiguous: 2025-05-07T20:32:52.3254283Z x0 = x0.contiguous() 2025-05-07T20:32:52.3254518Z x1 = x1.contiguous() 2025-05-07T20:32:52.3254619Z 2025-05-07T20:32:52.3254707Z if scale_ub is not None: 2025-05-07T20:32:52.3254812Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3254944Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3255022Z ) 2025-05-07T20:32:52.3255094Z else: 2025-05-07T20:32:52.3255187Z scale_ub_tensor = None 2025-05-07T20:32:52.3255262Z 2025-05-07T20:32:52.3255389Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3255478Z op = silu_mul_quant 2025-05-07T20:32:52.3255654Z if compiled: 2025-05-07T20:32:52.3255755Z op = torch.compile(op) 2025-05-07T20:32:52.3255857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3255931Z 2025-05-07T20:32:52.3256019Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3256098Z 2025-05-07T20:32:52.3256202Z moe/activation_test.py:117: 2025-05-07T20:32:52.3256332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3256432Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3256536Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3257058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3257151Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3257529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3257762Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3258124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3258217Z kernel = self.compile( 2025-05-07T20:32:52.3258618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3258798Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3258925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3258930Z 2025-05-07T20:32:52.3259135Z self = 2025-05-07T20:32:52.3259948Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3260464Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1278dbcc20>} 2025-05-07T20:32:52.3261253Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3261449Z context = 2025-05-07T20:32:52.3261453Z 2025-05-07T20:32:52.3261621Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3261889Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3261990Z module_map=module_map) 2025-05-07T20:32:52.3262154Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3262255Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3262338Z E ^ 2025-05-07T20:32:52.3262705Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3262710Z 2025-05-07T20:32:52.3263142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3263151Z 2025-05-07T20:32:52.3263263Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3263488Z self=, 2025-05-07T20:32:52.3263565Z T=128, 2025-05-07T20:32:52.3263647Z D=5120, 2025-05-07T20:32:52.3263728Z scale_ub=1200.0, 2025-05-07T20:32:52.3263808Z contiguous=True, 2025-05-07T20:32:52.3263897Z compiled=False, 2025-05-07T20:32:52.3263970Z ) 2025-05-07T20:32:52.3264227Z self = 2025-05-07T20:32:52.3264426Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:52.3264540Z 2025-05-07T20:32:52.3264619Z @given( 2025-05-07T20:32:52.3264742Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3264840Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3264950Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3265153Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3265268Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3265349Z ) 2025-05-07T20:32:52.3265601Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3265700Z def test_silu_mul_quant( 2025-05-07T20:32:52.3265782Z self, 2025-05-07T20:32:52.3265863Z T: int, 2025-05-07T20:32:52.3266739Z D: int, 2025-05-07T20:32:52.3266842Z scale_ub: Optional[float], 2025-05-07T20:32:52.3266929Z contiguous: bool, 2025-05-07T20:32:52.3267010Z compiled: bool, 2025-05-07T20:32:52.3267095Z ) -> None: 2025-05-07T20:32:52.3267201Z torch.manual_seed(2025) 2025-05-07T20:32:52.3267280Z 2025-05-07T20:32:52.3267457Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3267534Z 2025-05-07T20:32:52.3267636Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3267770Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3267864Z x = x_sign * x_clamp 2025-05-07T20:32:52.3267952Z x0 = x[:, :D] 2025-05-07T20:32:52.3268033Z x1 = x[:, D:] 2025-05-07T20:32:52.3268109Z 2025-05-07T20:32:52.3268200Z if contiguous: 2025-05-07T20:32:52.3268295Z x0 = x0.contiguous() 2025-05-07T20:32:52.3268387Z x1 = x1.contiguous() 2025-05-07T20:32:52.3268469Z 2025-05-07T20:32:52.3268558Z if scale_ub is not None: 2025-05-07T20:32:52.3268666Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3268813Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3268894Z ) 2025-05-07T20:32:52.3268982Z else: 2025-05-07T20:32:52.3269081Z scale_ub_tensor = None 2025-05-07T20:32:52.3269154Z 2025-05-07T20:32:52.3269285Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3269379Z op = silu_mul_quant 2025-05-07T20:32:52.3269466Z if compiled: 2025-05-07T20:32:52.3269565Z op = torch.compile(op) 2025-05-07T20:32:52.3269669Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3269737Z 2025-05-07T20:32:52.3269833Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3269838Z 2025-05-07T20:32:52.3269933Z moe/activation_test.py:117: 2025-05-07T20:32:52.3270062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3270167Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3270265Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3270797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3270895Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3271270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3271507Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3271863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3271960Z kernel = self.compile( 2025-05-07T20:32:52.3272361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3272535Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3272667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3272671Z 2025-05-07T20:32:52.3272970Z self = 2025-05-07T20:32:52.3273781Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3274377Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1278dbeca0>} 2025-05-07T20:32:52.3275170Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3275370Z context = 2025-05-07T20:32:52.3275374Z 2025-05-07T20:32:52.3275545Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3275835Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3275944Z module_map=module_map) 2025-05-07T20:32:52.3276109Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3276221Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3276302Z E ^ 2025-05-07T20:32:52.3276671Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3276676Z 2025-05-07T20:32:52.3277116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3277121Z 2025-05-07T20:32:52.3277224Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3277452Z self=, 2025-05-07T20:32:52.3277530Z T=1, 2025-05-07T20:32:52.3277612Z D=7168, 2025-05-07T20:32:52.3277695Z scale_ub=1200.0, 2025-05-07T20:32:52.3277784Z contiguous=True, 2025-05-07T20:32:52.3277867Z compiled=True, 2025-05-07T20:32:52.3277945Z ) 2025-05-07T20:32:52.3278167Z self = 2025-05-07T20:32:52.3278335Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:52.3278345Z 2025-05-07T20:32:52.3278425Z @given( 2025-05-07T20:32:52.3278544Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3278648Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3278760Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3278877Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3278997Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3279069Z ) 2025-05-07T20:32:52.3279316Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3279415Z def test_silu_mul_quant( 2025-05-07T20:32:52.3279495Z self, 2025-05-07T20:32:52.3279574Z T: int, 2025-05-07T20:32:52.3279648Z D: int, 2025-05-07T20:32:52.3279742Z scale_ub: Optional[float], 2025-05-07T20:32:52.3279835Z contiguous: bool, 2025-05-07T20:32:52.3279921Z compiled: bool, 2025-05-07T20:32:52.3280000Z ) -> None: 2025-05-07T20:32:52.3280100Z torch.manual_seed(2025) 2025-05-07T20:32:52.3280172Z 2025-05-07T20:32:52.3280340Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3280420Z 2025-05-07T20:32:52.3280508Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3280630Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3280723Z x = x_sign * x_clamp 2025-05-07T20:32:52.3280799Z x0 = x[:, :D] 2025-05-07T20:32:52.3280879Z x1 = x[:, D:] 2025-05-07T20:32:52.3280951Z 2025-05-07T20:32:52.3281029Z if contiguous: 2025-05-07T20:32:52.3281123Z x0 = x0.contiguous() 2025-05-07T20:32:52.3281297Z x1 = x1.contiguous() 2025-05-07T20:32:52.3281370Z 2025-05-07T20:32:52.3281465Z if scale_ub is not None: 2025-05-07T20:32:52.3281567Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3281699Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3281856Z ) 2025-05-07T20:32:52.3281935Z else: 2025-05-07T20:32:52.3282030Z scale_ub_tensor = None 2025-05-07T20:32:52.3282108Z 2025-05-07T20:32:52.3282235Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3282321Z op = silu_mul_quant 2025-05-07T20:32:52.3282406Z if compiled: 2025-05-07T20:32:52.3282502Z op = torch.compile(op) 2025-05-07T20:32:52.3282610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3282682Z 2025-05-07T20:32:52.3282772Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3282776Z 2025-05-07T20:32:52.3282877Z moe/activation_test.py:117: 2025-05-07T20:32:52.3283013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3283115Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3283221Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3283611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3283701Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3284278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3284375Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3284748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3284980Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3285343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3285441Z kernel = self.compile( 2025-05-07T20:32:52.3285839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3286024Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3286152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3286156Z 2025-05-07T20:32:52.3286365Z self = 2025-05-07T20:32:52.3287176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3287693Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1278524040>} 2025-05-07T20:32:52.3288484Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3288686Z context = 2025-05-07T20:32:52.3288690Z 2025-05-07T20:32:52.3288858Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3289129Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3289235Z module_map=module_map) 2025-05-07T20:32:52.3289397Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3289494Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3289571Z E ^ 2025-05-07T20:32:52.3290026Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3290031Z 2025-05-07T20:32:52.3290464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3290469Z 2025-05-07T20:32:52.3290677Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3290905Z self=, 2025-05-07T20:32:52.3290988Z T=1, 2025-05-07T20:32:52.3291069Z D=7168, 2025-05-07T20:32:52.3291149Z scale_ub=1200.0, 2025-05-07T20:32:52.3291235Z contiguous=False, 2025-05-07T20:32:52.3291321Z compiled=True, 2025-05-07T20:32:52.3291392Z ) 2025-05-07T20:32:52.3291611Z self = 2025-05-07T20:32:52.3291780Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:52.3291784Z 2025-05-07T20:32:52.3291861Z @given( 2025-05-07T20:32:52.3291996Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3292094Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3292207Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3292328Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3292446Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3292518Z ) 2025-05-07T20:32:52.3292767Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3292860Z def test_silu_mul_quant( 2025-05-07T20:32:52.3292934Z self, 2025-05-07T20:32:52.3293013Z T: int, 2025-05-07T20:32:52.3293087Z D: int, 2025-05-07T20:32:52.3293182Z scale_ub: Optional[float], 2025-05-07T20:32:52.3293276Z contiguous: bool, 2025-05-07T20:32:52.3293357Z compiled: bool, 2025-05-07T20:32:52.3293435Z ) -> None: 2025-05-07T20:32:52.3293526Z torch.manual_seed(2025) 2025-05-07T20:32:52.3293599Z 2025-05-07T20:32:52.3293775Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3293864Z 2025-05-07T20:32:52.3293964Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3294112Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3294197Z x = x_sign * x_clamp 2025-05-07T20:32:52.3294284Z x0 = x[:, :D] 2025-05-07T20:32:52.3294463Z x1 = x[:, D:] 2025-05-07T20:32:52.3294560Z 2025-05-07T20:32:52.3294676Z if contiguous: 2025-05-07T20:32:52.3294806Z x0 = x0.contiguous() 2025-05-07T20:32:52.3294893Z x1 = x1.contiguous() 2025-05-07T20:32:52.3294969Z 2025-05-07T20:32:52.3295053Z if scale_ub is not None: 2025-05-07T20:32:52.3295152Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3295290Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3295362Z ) 2025-05-07T20:32:52.3295436Z else: 2025-05-07T20:32:52.3295533Z scale_ub_tensor = None 2025-05-07T20:32:52.3295610Z 2025-05-07T20:32:52.3295735Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3295828Z op = silu_mul_quant 2025-05-07T20:32:52.3295909Z if compiled: 2025-05-07T20:32:52.3296005Z op = torch.compile(op) 2025-05-07T20:32:52.3296115Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3296183Z 2025-05-07T20:32:52.3296272Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3296277Z 2025-05-07T20:32:52.3296368Z moe/activation_test.py:117: 2025-05-07T20:32:52.3296495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3296601Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3296698Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3297081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3297173Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3297787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3297890Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3298267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3298572Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3298928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3299021Z kernel = self.compile( 2025-05-07T20:32:52.3299416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3299592Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3299717Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3299721Z 2025-05-07T20:32:52.3299932Z self = 2025-05-07T20:32:52.3300737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3301257Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1278524ea0>} 2025-05-07T20:32:52.3302051Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3302241Z context = 2025-05-07T20:32:52.3302246Z 2025-05-07T20:32:52.3302420Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3302685Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3302793Z module_map=module_map) 2025-05-07T20:32:52.3302954Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3303050Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3303125Z E ^ 2025-05-07T20:32:52.3303491Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3303495Z 2025-05-07T20:32:52.3303921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3303925Z 2025-05-07T20:32:52.3304028Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3304249Z self=, 2025-05-07T20:32:52.3304328Z T=1, 2025-05-07T20:32:52.3304409Z D=7168, 2025-05-07T20:32:52.3304502Z scale_ub=None, 2025-05-07T20:32:52.3304593Z contiguous=False, 2025-05-07T20:32:52.3304674Z compiled=True, 2025-05-07T20:32:52.3304747Z ) 2025-05-07T20:32:52.3304974Z self = 2025-05-07T20:32:52.3305141Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.3305145Z 2025-05-07T20:32:52.3305247Z @given( 2025-05-07T20:32:52.3305414Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3305540Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3305690Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3305843Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3305973Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3306051Z ) 2025-05-07T20:32:52.3306299Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3306487Z def test_silu_mul_quant( 2025-05-07T20:32:52.3306571Z self, 2025-05-07T20:32:52.3306647Z T: int, 2025-05-07T20:32:52.3306728Z D: int, 2025-05-07T20:32:52.3306840Z scale_ub: Optional[float], 2025-05-07T20:32:52.3307006Z contiguous: bool, 2025-05-07T20:32:52.3310936Z compiled: bool, 2025-05-07T20:32:52.3311044Z ) -> None: 2025-05-07T20:32:52.3311144Z torch.manual_seed(2025) 2025-05-07T20:32:52.3311213Z 2025-05-07T20:32:52.3311384Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3311460Z 2025-05-07T20:32:52.3311550Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3311672Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3311764Z x = x_sign * x_clamp 2025-05-07T20:32:52.3311841Z x0 = x[:, :D] 2025-05-07T20:32:52.3311918Z x1 = x[:, D:] 2025-05-07T20:32:52.3311992Z 2025-05-07T20:32:52.3312071Z if contiguous: 2025-05-07T20:32:52.3312170Z x0 = x0.contiguous() 2025-05-07T20:32:52.3312258Z x1 = x1.contiguous() 2025-05-07T20:32:52.3312327Z 2025-05-07T20:32:52.3312415Z if scale_ub is not None: 2025-05-07T20:32:52.3312516Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3312653Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3312731Z ) 2025-05-07T20:32:52.3312806Z else: 2025-05-07T20:32:52.3312896Z scale_ub_tensor = None 2025-05-07T20:32:52.3312975Z 2025-05-07T20:32:52.3313102Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3313186Z op = silu_mul_quant 2025-05-07T20:32:52.3313267Z if compiled: 2025-05-07T20:32:52.3313361Z op = torch.compile(op) 2025-05-07T20:32:52.3313466Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3313537Z 2025-05-07T20:32:52.3313623Z y_fp8, y_scale = fn() 2025-05-07T20:32:52.3313747Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:32:52.3313813Z 2025-05-07T20:32:52.3313943Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3314046Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:32:52.3314147Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:32:52.3314265Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:32:52.3314408Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.3314476Z 2025-05-07T20:32:52.3314571Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:32:52.3314583Z 2025-05-07T20:32:52.3314676Z moe/activation_test.py:126: 2025-05-07T20:32:52.3314804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3314913Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:32:52.3315046Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:32:52.3315639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:32:52.3315738Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:32:52.3316110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3316343Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3316728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:32:52.3316987Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:32:52.3317383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:32:52.3317546Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:32:52.3318007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:32:52.3318087Z fn() 2025-05-07T20:32:52.3318506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:32:52.3318670Z self.fn.run( 2025-05-07T20:32:52.3319025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3319115Z kernel = self.compile( 2025-05-07T20:32:52.3319517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3319689Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3319819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3319832Z 2025-05-07T20:32:52.3320039Z self = 2025-05-07T20:32:52.3320855Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3321372Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1278e5ec00>} 2025-05-07T20:32:52.3322169Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3322364Z context = 2025-05-07T20:32:52.3322368Z 2025-05-07T20:32:52.3322537Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3322803Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3322915Z module_map=module_map) 2025-05-07T20:32:52.3323074Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3323183Z E def _kernel_quantize_fp8_row( 2025-05-07T20:32:52.3323264Z E ^ 2025-05-07T20:32:52.3323637Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3323642Z 2025-05-07T20:32:52.3324080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3324084Z 2025-05-07T20:32:52.3324185Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3324413Z self=, 2025-05-07T20:32:52.3324488Z T=1, 2025-05-07T20:32:52.3324559Z D=5120, 2025-05-07T20:32:52.3324645Z scale_ub=1200.0, 2025-05-07T20:32:52.3324727Z contiguous=False, 2025-05-07T20:32:52.3324808Z compiled=True, 2025-05-07T20:32:52.3324889Z ) 2025-05-07T20:32:52.3325112Z self = 2025-05-07T20:32:52.3325280Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:52.3325289Z 2025-05-07T20:32:52.3325364Z @given( 2025-05-07T20:32:52.3325738Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3325841Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3325952Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3326065Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3326175Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3326244Z ) 2025-05-07T20:32:52.3326493Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3326587Z def test_silu_mul_quant( 2025-05-07T20:32:52.3326662Z self, 2025-05-07T20:32:52.3326741Z T: int, 2025-05-07T20:32:52.3326966Z D: int, 2025-05-07T20:32:52.3327064Z scale_ub: Optional[float], 2025-05-07T20:32:52.3327152Z contiguous: bool, 2025-05-07T20:32:52.3327240Z compiled: bool, 2025-05-07T20:32:52.3327316Z ) -> None: 2025-05-07T20:32:52.3327542Z torch.manual_seed(2025) 2025-05-07T20:32:52.3327612Z 2025-05-07T20:32:52.3327781Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3327855Z 2025-05-07T20:32:52.3327947Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3328069Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3328159Z x = x_sign * x_clamp 2025-05-07T20:32:52.3328236Z x0 = x[:, :D] 2025-05-07T20:32:52.3328316Z x1 = x[:, D:] 2025-05-07T20:32:52.3328385Z 2025-05-07T20:32:52.3328461Z if contiguous: 2025-05-07T20:32:52.3328548Z x0 = x0.contiguous() 2025-05-07T20:32:52.3328639Z x1 = x1.contiguous() 2025-05-07T20:32:52.3328711Z 2025-05-07T20:32:52.3328803Z if scale_ub is not None: 2025-05-07T20:32:52.3328907Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3329036Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3329117Z ) 2025-05-07T20:32:52.3329199Z else: 2025-05-07T20:32:52.3329288Z scale_ub_tensor = None 2025-05-07T20:32:52.3329361Z 2025-05-07T20:32:52.3329488Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3329573Z op = silu_mul_quant 2025-05-07T20:32:52.3329659Z if compiled: 2025-05-07T20:32:52.3329754Z op = torch.compile(op) 2025-05-07T20:32:52.3329855Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3329926Z 2025-05-07T20:32:52.3330010Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3330015Z 2025-05-07T20:32:52.3330106Z moe/activation_test.py:117: 2025-05-07T20:32:52.3330243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3330339Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3330440Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3330821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3330915Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3331443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3331537Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3331908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3332139Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3332494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3332588Z kernel = self.compile( 2025-05-07T20:32:52.3332993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3333169Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3333307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3333311Z 2025-05-07T20:32:52.3333514Z self = 2025-05-07T20:32:52.3334352Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3335008Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1279459f80>} 2025-05-07T20:32:52.3335890Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3336082Z context = 2025-05-07T20:32:52.3336163Z 2025-05-07T20:32:52.3336328Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3336606Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3336711Z module_map=module_map) 2025-05-07T20:32:52.3336871Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3336972Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3337048Z E ^ 2025-05-07T20:32:52.3337423Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3337428Z 2025-05-07T20:32:52.3337868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3337872Z 2025-05-07T20:32:52.3337978Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3338204Z self=, 2025-05-07T20:32:52.3338283Z T=1, 2025-05-07T20:32:52.3338358Z D=5120, 2025-05-07T20:32:52.3338438Z scale_ub=1200.0, 2025-05-07T20:32:52.3338521Z contiguous=False, 2025-05-07T20:32:52.3338607Z compiled=False, 2025-05-07T20:32:52.3338680Z ) 2025-05-07T20:32:52.3338900Z self = 2025-05-07T20:32:52.3339076Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:52.3339080Z 2025-05-07T20:32:52.3339156Z @given( 2025-05-07T20:32:52.3339270Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3339371Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3339487Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3339601Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3339711Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3339783Z ) 2025-05-07T20:32:52.3340040Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3340131Z def test_silu_mul_quant( 2025-05-07T20:32:52.3340204Z self, 2025-05-07T20:32:52.3340279Z T: int, 2025-05-07T20:32:52.3340350Z D: int, 2025-05-07T20:32:52.3340444Z scale_ub: Optional[float], 2025-05-07T20:32:52.3340536Z contiguous: bool, 2025-05-07T20:32:52.3340616Z compiled: bool, 2025-05-07T20:32:52.3340689Z ) -> None: 2025-05-07T20:32:52.3340784Z torch.manual_seed(2025) 2025-05-07T20:32:52.3340853Z 2025-05-07T20:32:52.3341020Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3341097Z 2025-05-07T20:32:52.3341189Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3341314Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3341397Z x = x_sign * x_clamp 2025-05-07T20:32:52.3341472Z x0 = x[:, :D] 2025-05-07T20:32:52.3341555Z x1 = x[:, D:] 2025-05-07T20:32:52.3341624Z 2025-05-07T20:32:52.3341705Z if contiguous: 2025-05-07T20:32:52.3341799Z x0 = x0.contiguous() 2025-05-07T20:32:52.3341884Z x1 = x1.contiguous() 2025-05-07T20:32:52.3341953Z 2025-05-07T20:32:52.3342046Z if scale_ub is not None: 2025-05-07T20:32:52.3342146Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3342280Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3342356Z ) 2025-05-07T20:32:52.3342427Z else: 2025-05-07T20:32:52.3342522Z scale_ub_tensor = None 2025-05-07T20:32:52.3342598Z 2025-05-07T20:32:52.3342807Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3342903Z op = silu_mul_quant 2025-05-07T20:32:52.3342983Z if compiled: 2025-05-07T20:32:52.3343076Z op = torch.compile(op) 2025-05-07T20:32:52.3343181Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3343323Z 2025-05-07T20:32:52.3343411Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3343415Z 2025-05-07T20:32:52.3343510Z moe/activation_test.py:117: 2025-05-07T20:32:52.3343645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3343744Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3343842Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3344365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3344460Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3344843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3345069Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3345429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3345521Z kernel = self.compile( 2025-05-07T20:32:52.3345926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3346099Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3346227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3346231Z 2025-05-07T20:32:52.3346439Z self = 2025-05-07T20:32:52.3347253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3347770Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f12793bae80>} 2025-05-07T20:32:52.3348562Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3348755Z context = 2025-05-07T20:32:52.3348759Z 2025-05-07T20:32:52.3348925Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3349192Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3349299Z module_map=module_map) 2025-05-07T20:32:52.3349463Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3349560Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3349637Z E ^ 2025-05-07T20:32:52.3350004Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3350012Z 2025-05-07T20:32:52.3350447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3350451Z 2025-05-07T20:32:52.3350550Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3350772Z self=, 2025-05-07T20:32:52.3350853Z T=16384, 2025-05-07T20:32:52.3350930Z D=5120, 2025-05-07T20:32:52.3351007Z scale_ub=1200.0, 2025-05-07T20:32:52.3351091Z contiguous=False, 2025-05-07T20:32:52.3351170Z compiled=True, 2025-05-07T20:32:52.3351243Z ) 2025-05-07T20:32:52.3351467Z self = 2025-05-07T20:32:52.3351730Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:52.3351735Z 2025-05-07T20:32:52.3351814Z @given( 2025-05-07T20:32:52.3351926Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3352099Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3352214Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3352325Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3352432Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3352502Z ) 2025-05-07T20:32:52.3352747Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3352834Z def test_silu_mul_quant( 2025-05-07T20:32:52.3352911Z self, 2025-05-07T20:32:52.3352986Z T: int, 2025-05-07T20:32:52.3353060Z D: int, 2025-05-07T20:32:52.3353153Z scale_ub: Optional[float], 2025-05-07T20:32:52.3353245Z contiguous: bool, 2025-05-07T20:32:52.3353330Z compiled: bool, 2025-05-07T20:32:52.3353404Z ) -> None: 2025-05-07T20:32:52.3353491Z torch.manual_seed(2025) 2025-05-07T20:32:52.3353563Z 2025-05-07T20:32:52.3353730Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3353808Z 2025-05-07T20:32:52.3353900Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3354022Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3354108Z x = x_sign * x_clamp 2025-05-07T20:32:52.3354189Z x0 = x[:, :D] 2025-05-07T20:32:52.3354279Z x1 = x[:, D:] 2025-05-07T20:32:52.3354357Z 2025-05-07T20:32:52.3354454Z if contiguous: 2025-05-07T20:32:52.3354557Z x0 = x0.contiguous() 2025-05-07T20:32:52.3354648Z x1 = x1.contiguous() 2025-05-07T20:32:52.3354716Z 2025-05-07T20:32:52.3354803Z if scale_ub is not None: 2025-05-07T20:32:52.3354906Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3355040Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3355112Z ) 2025-05-07T20:32:52.3355190Z else: 2025-05-07T20:32:52.3355279Z scale_ub_tensor = None 2025-05-07T20:32:52.3355349Z 2025-05-07T20:32:52.3355484Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3355574Z op = silu_mul_quant 2025-05-07T20:32:52.3355655Z if compiled: 2025-05-07T20:32:52.3355755Z op = torch.compile(op) 2025-05-07T20:32:52.3355857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3355931Z 2025-05-07T20:32:52.3356016Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3356021Z 2025-05-07T20:32:52.3356115Z moe/activation_test.py:117: 2025-05-07T20:32:52.3356249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3356345Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3356446Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3356834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3356922Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3357444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3357542Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3357915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3358142Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3358495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3358584Z kernel = self.compile( 2025-05-07T20:32:52.3359164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3359342Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3359473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3359477Z 2025-05-07T20:32:52.3359761Z self = 2025-05-07T20:32:52.3360567Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3361079Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f12799ba8e0>} 2025-05-07T20:32:52.3361873Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3362067Z context = 2025-05-07T20:32:52.3362072Z 2025-05-07T20:32:52.3362236Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3362513Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3362618Z module_map=module_map) 2025-05-07T20:32:52.3362779Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3362876Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3362950Z E ^ 2025-05-07T20:32:52.3363318Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3363323Z 2025-05-07T20:32:52.3363759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3363768Z 2025-05-07T20:32:52.3363876Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3364110Z self=, 2025-05-07T20:32:52.3364192Z T=2048, 2025-05-07T20:32:52.3364286Z D=7168, 2025-05-07T20:32:52.3364388Z scale_ub=1200.0, 2025-05-07T20:32:52.3364493Z contiguous=False, 2025-05-07T20:32:52.3364571Z compiled=True, 2025-05-07T20:32:52.3364649Z ) 2025-05-07T20:32:52.3364871Z self = 2025-05-07T20:32:52.3365045Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:52.3365057Z 2025-05-07T20:32:52.3365135Z @given( 2025-05-07T20:32:52.3365249Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3365350Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3365464Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3365586Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3365700Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3365771Z ) 2025-05-07T20:32:52.3366019Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3366121Z def test_silu_mul_quant( 2025-05-07T20:32:52.3366194Z self, 2025-05-07T20:32:52.3366269Z T: int, 2025-05-07T20:32:52.3366347Z D: int, 2025-05-07T20:32:52.3366441Z scale_ub: Optional[float], 2025-05-07T20:32:52.3366535Z contiguous: bool, 2025-05-07T20:32:52.3366615Z compiled: bool, 2025-05-07T20:32:52.3366689Z ) -> None: 2025-05-07T20:32:52.3366781Z torch.manual_seed(2025) 2025-05-07T20:32:52.3366849Z 2025-05-07T20:32:52.3367016Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3367091Z 2025-05-07T20:32:52.3367179Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3367387Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3367479Z x = x_sign * x_clamp 2025-05-07T20:32:52.3367554Z x0 = x[:, :D] 2025-05-07T20:32:52.3367633Z x1 = x[:, D:] 2025-05-07T20:32:52.3367706Z 2025-05-07T20:32:52.3367785Z if contiguous: 2025-05-07T20:32:52.3367957Z x0 = x0.contiguous() 2025-05-07T20:32:52.3368047Z x1 = x1.contiguous() 2025-05-07T20:32:52.3368117Z 2025-05-07T20:32:52.3368207Z if scale_ub is not None: 2025-05-07T20:32:52.3368308Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3368445Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3368519Z ) 2025-05-07T20:32:52.3368596Z else: 2025-05-07T20:32:52.3368686Z scale_ub_tensor = None 2025-05-07T20:32:52.3368760Z 2025-05-07T20:32:52.3368891Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3368976Z op = silu_mul_quant 2025-05-07T20:32:52.3369062Z if compiled: 2025-05-07T20:32:52.3369166Z op = torch.compile(op) 2025-05-07T20:32:52.3369277Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3369347Z 2025-05-07T20:32:52.3369438Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3369449Z 2025-05-07T20:32:52.3369545Z moe/activation_test.py:117: 2025-05-07T20:32:52.3369672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3369775Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3369875Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3370261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3370349Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3370867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3370964Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3371341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3371569Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3371921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3372019Z kernel = self.compile( 2025-05-07T20:32:52.3372420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3372591Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3372719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3372723Z 2025-05-07T20:32:52.3372933Z self = 2025-05-07T20:32:52.3373743Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3374255Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1279602840>} 2025-05-07T20:32:52.3375182Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3375382Z context = 2025-05-07T20:32:52.3375387Z 2025-05-07T20:32:52.3375556Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3375829Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3376026Z module_map=module_map) 2025-05-07T20:32:52.3376189Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3376288Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3376374Z E ^ 2025-05-07T20:32:52.3376736Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3376816Z 2025-05-07T20:32:52.3377255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3377260Z 2025-05-07T20:32:52.3377363Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3377589Z self=, 2025-05-07T20:32:52.3377672Z T=1, 2025-05-07T20:32:52.3377746Z D=5120, 2025-05-07T20:32:52.3377829Z scale_ub=None, 2025-05-07T20:32:52.3377923Z contiguous=False, 2025-05-07T20:32:52.3378006Z compiled=False, 2025-05-07T20:32:52.3378082Z ) 2025-05-07T20:32:52.3378314Z self = 2025-05-07T20:32:52.3378479Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:52.3378484Z 2025-05-07T20:32:52.3378565Z @given( 2025-05-07T20:32:52.3378688Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3378790Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3378914Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3379032Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3379146Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3379222Z ) 2025-05-07T20:32:52.3379477Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3379574Z def test_silu_mul_quant( 2025-05-07T20:32:52.3379653Z self, 2025-05-07T20:32:52.3379729Z T: int, 2025-05-07T20:32:52.3379810Z D: int, 2025-05-07T20:32:52.3379917Z scale_ub: Optional[float], 2025-05-07T20:32:52.3380006Z contiguous: bool, 2025-05-07T20:32:52.3380094Z compiled: bool, 2025-05-07T20:32:52.3380169Z ) -> None: 2025-05-07T20:32:52.3380259Z torch.manual_seed(2025) 2025-05-07T20:32:52.3380338Z 2025-05-07T20:32:52.3380505Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3380577Z 2025-05-07T20:32:52.3380667Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3380789Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3380879Z x = x_sign * x_clamp 2025-05-07T20:32:52.3380956Z x0 = x[:, :D] 2025-05-07T20:32:52.3381034Z x1 = x[:, D:] 2025-05-07T20:32:52.3381108Z 2025-05-07T20:32:52.3381187Z if contiguous: 2025-05-07T20:32:52.3381274Z x0 = x0.contiguous() 2025-05-07T20:32:52.3381363Z x1 = x1.contiguous() 2025-05-07T20:32:52.3381437Z 2025-05-07T20:32:52.3381523Z if scale_ub is not None: 2025-05-07T20:32:52.3381635Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3381770Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3381843Z ) 2025-05-07T20:32:52.3381918Z else: 2025-05-07T20:32:52.3382013Z scale_ub_tensor = None 2025-05-07T20:32:52.3382088Z 2025-05-07T20:32:52.3382216Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3382303Z op = silu_mul_quant 2025-05-07T20:32:52.3382389Z if compiled: 2025-05-07T20:32:52.3382484Z op = torch.compile(op) 2025-05-07T20:32:52.3382586Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3382660Z 2025-05-07T20:32:52.3382746Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3382751Z 2025-05-07T20:32:52.3382845Z moe/activation_test.py:117: 2025-05-07T20:32:52.3382978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3383160Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3383259Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3383801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3384001Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3384381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3384606Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3384961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3385055Z kernel = self.compile( 2025-05-07T20:32:52.3385454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3385638Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3385773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3385778Z 2025-05-07T20:32:52.3385980Z self = 2025-05-07T20:32:52.3386792Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3387314Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127a0362a0>} 2025-05-07T20:32:52.3388108Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3388307Z context = 2025-05-07T20:32:52.3388312Z 2025-05-07T20:32:52.3388478Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3388752Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3388862Z module_map=module_map) 2025-05-07T20:32:52.3389025Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3389118Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3389193Z E ^ 2025-05-07T20:32:52.3389562Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3389567Z 2025-05-07T20:32:52.3389997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3390001Z 2025-05-07T20:32:52.3390104Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3390334Z self=, 2025-05-07T20:32:52.3390411Z T=4096, 2025-05-07T20:32:52.3390490Z D=7168, 2025-05-07T20:32:52.3390572Z scale_ub=1200.0, 2025-05-07T20:32:52.3390660Z contiguous=False, 2025-05-07T20:32:52.3390752Z compiled=False, 2025-05-07T20:32:52.3390822Z ) 2025-05-07T20:32:52.3391041Z self = 2025-05-07T20:32:52.3391220Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:52.3391224Z 2025-05-07T20:32:52.3391300Z @given( 2025-05-07T20:32:52.3391415Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3391510Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3391621Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3391734Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3391843Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3391914Z ) 2025-05-07T20:32:52.3392270Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3392364Z def test_silu_mul_quant( 2025-05-07T20:32:52.3392443Z self, 2025-05-07T20:32:52.3392523Z T: int, 2025-05-07T20:32:52.3392668Z D: int, 2025-05-07T20:32:52.3392765Z scale_ub: Optional[float], 2025-05-07T20:32:52.3392850Z contiguous: bool, 2025-05-07T20:32:52.3392932Z compiled: bool, 2025-05-07T20:32:52.3393009Z ) -> None: 2025-05-07T20:32:52.3393103Z torch.manual_seed(2025) 2025-05-07T20:32:52.3393183Z 2025-05-07T20:32:52.3393361Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3393436Z 2025-05-07T20:32:52.3393531Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3393662Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3393751Z x = x_sign * x_clamp 2025-05-07T20:32:52.3393829Z x0 = x[:, :D] 2025-05-07T20:32:52.3393917Z x1 = x[:, D:] 2025-05-07T20:32:52.3393991Z 2025-05-07T20:32:52.3394076Z if contiguous: 2025-05-07T20:32:52.3394173Z x0 = x0.contiguous() 2025-05-07T20:32:52.3394261Z x1 = x1.contiguous() 2025-05-07T20:32:52.3394359Z 2025-05-07T20:32:52.3394462Z if scale_ub is not None: 2025-05-07T20:32:52.3394591Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3394733Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3394810Z ) 2025-05-07T20:32:52.3394887Z else: 2025-05-07T20:32:52.3394984Z scale_ub_tensor = None 2025-05-07T20:32:52.3395059Z 2025-05-07T20:32:52.3395188Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3395277Z op = silu_mul_quant 2025-05-07T20:32:52.3395356Z if compiled: 2025-05-07T20:32:52.3395454Z op = torch.compile(op) 2025-05-07T20:32:52.3395573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3395643Z 2025-05-07T20:32:52.3395735Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3395739Z 2025-05-07T20:32:52.3395830Z moe/activation_test.py:117: 2025-05-07T20:32:52.3395961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3396067Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3396165Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3396688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3396783Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3397153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3397381Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3397740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3397833Z kernel = self.compile( 2025-05-07T20:32:52.3398234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3398409Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3398544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3398551Z 2025-05-07T20:32:52.3398752Z self = 2025-05-07T20:32:52.3399558Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3400157Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127a0ef100>} 2025-05-07T20:32:52.3400949Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3401220Z context = 2025-05-07T20:32:52.3401225Z 2025-05-07T20:32:52.3401393Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3401661Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3401769Z module_map=module_map) 2025-05-07T20:32:52.3401930Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3402031Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3402112Z E ^ 2025-05-07T20:32:52.3402481Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3402486Z 2025-05-07T20:32:52.3402919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3402924Z 2025-05-07T20:32:52.3403025Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3403257Z self=, 2025-05-07T20:32:52.3403332Z T=16384, 2025-05-07T20:32:52.3403408Z D=7168, 2025-05-07T20:32:52.3403490Z scale_ub=None, 2025-05-07T20:32:52.3403572Z contiguous=True, 2025-05-07T20:32:52.3403655Z compiled=True, 2025-05-07T20:32:52.3403730Z ) 2025-05-07T20:32:52.3403952Z self = 2025-05-07T20:32:52.3404128Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:52.3404133Z 2025-05-07T20:32:52.3404213Z @given( 2025-05-07T20:32:52.3404326Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3404428Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3404545Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3404657Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3404772Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3404844Z ) 2025-05-07T20:32:52.3405090Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3405183Z def test_silu_mul_quant( 2025-05-07T20:32:52.3405255Z self, 2025-05-07T20:32:52.3405328Z T: int, 2025-05-07T20:32:52.3405406Z D: int, 2025-05-07T20:32:52.3405500Z scale_ub: Optional[float], 2025-05-07T20:32:52.3405586Z contiguous: bool, 2025-05-07T20:32:52.3405672Z compiled: bool, 2025-05-07T20:32:52.3405747Z ) -> None: 2025-05-07T20:32:52.3405840Z torch.manual_seed(2025) 2025-05-07T20:32:52.3405914Z 2025-05-07T20:32:52.3406088Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3406167Z 2025-05-07T20:32:52.3406258Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3406381Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3406474Z x = x_sign * x_clamp 2025-05-07T20:32:52.3406558Z x0 = x[:, :D] 2025-05-07T20:32:52.3406636Z x1 = x[:, D:] 2025-05-07T20:32:52.3406715Z 2025-05-07T20:32:52.3406798Z if contiguous: 2025-05-07T20:32:52.3406890Z x0 = x0.contiguous() 2025-05-07T20:32:52.3406984Z x1 = x1.contiguous() 2025-05-07T20:32:52.3407058Z 2025-05-07T20:32:52.3407148Z if scale_ub is not None: 2025-05-07T20:32:52.3407253Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3407386Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3407462Z ) 2025-05-07T20:32:52.3407539Z else: 2025-05-07T20:32:52.3407631Z scale_ub_tensor = None 2025-05-07T20:32:52.3407794Z 2025-05-07T20:32:52.3407927Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3408014Z op = silu_mul_quant 2025-05-07T20:32:52.3408096Z if compiled: 2025-05-07T20:32:52.3408191Z op = torch.compile(op) 2025-05-07T20:32:52.3408370Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3408443Z 2025-05-07T20:32:52.3408530Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3408535Z 2025-05-07T20:32:52.3408629Z moe/activation_test.py:117: 2025-05-07T20:32:52.3408763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3408861Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3408963Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3409346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3409439Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3409968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3410066Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3410438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3410676Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3411030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3411123Z kernel = self.compile( 2025-05-07T20:32:52.3411520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3411695Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3411827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3411831Z 2025-05-07T20:32:52.3412042Z self = 2025-05-07T20:32:52.3412853Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3413371Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127a6a5300>} 2025-05-07T20:32:52.3414167Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3414501Z context = 2025-05-07T20:32:52.3414507Z 2025-05-07T20:32:52.3414693Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3414964Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3415065Z module_map=module_map) 2025-05-07T20:32:52.3415224Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3415327Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3415401Z E ^ 2025-05-07T20:32:52.3415773Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3415777Z 2025-05-07T20:32:52.3416208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3416213Z 2025-05-07T20:32:52.3416311Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3416539Z self=, 2025-05-07T20:32:52.3416616Z T=4096, 2025-05-07T20:32:52.3416774Z D=5120, 2025-05-07T20:32:52.3416855Z scale_ub=None, 2025-05-07T20:32:52.3416935Z contiguous=False, 2025-05-07T20:32:52.3417032Z compiled=True, 2025-05-07T20:32:52.3417100Z ) 2025-05-07T20:32:52.3417319Z self = 2025-05-07T20:32:52.3417600Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.3417604Z 2025-05-07T20:32:52.3417683Z @given( 2025-05-07T20:32:52.3417800Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3417904Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3418018Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3418132Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3418246Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3418323Z ) 2025-05-07T20:32:52.3418576Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3418675Z def test_silu_mul_quant( 2025-05-07T20:32:52.3418748Z self, 2025-05-07T20:32:52.3418829Z T: int, 2025-05-07T20:32:52.3418907Z D: int, 2025-05-07T20:32:52.3419000Z scale_ub: Optional[float], 2025-05-07T20:32:52.3419093Z contiguous: bool, 2025-05-07T20:32:52.3419174Z compiled: bool, 2025-05-07T20:32:52.3419250Z ) -> None: 2025-05-07T20:32:52.3419344Z torch.manual_seed(2025) 2025-05-07T20:32:52.3419420Z 2025-05-07T20:32:52.3419589Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3419668Z 2025-05-07T20:32:52.3419760Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3419887Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3419976Z x = x_sign * x_clamp 2025-05-07T20:32:52.3420054Z x0 = x[:, :D] 2025-05-07T20:32:52.3420136Z x1 = x[:, D:] 2025-05-07T20:32:52.3420214Z 2025-05-07T20:32:52.3420299Z if contiguous: 2025-05-07T20:32:52.3420400Z x0 = x0.contiguous() 2025-05-07T20:32:52.3420489Z x1 = x1.contiguous() 2025-05-07T20:32:52.3420562Z 2025-05-07T20:32:52.3420656Z if scale_ub is not None: 2025-05-07T20:32:52.3420761Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3420901Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3420980Z ) 2025-05-07T20:32:52.3421055Z else: 2025-05-07T20:32:52.3421145Z scale_ub_tensor = None 2025-05-07T20:32:52.3421222Z 2025-05-07T20:32:52.3421350Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3421442Z op = silu_mul_quant 2025-05-07T20:32:52.3421522Z if compiled: 2025-05-07T20:32:52.3421618Z op = torch.compile(op) 2025-05-07T20:32:52.3421722Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3421794Z 2025-05-07T20:32:52.3421879Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3421883Z 2025-05-07T20:32:52.3421990Z moe/activation_test.py:117: 2025-05-07T20:32:52.3422119Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3422214Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3422313Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3422706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3422800Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3423323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3423420Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3423797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3424024Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3424514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3424605Z kernel = self.compile( 2025-05-07T20:32:52.3425002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3425257Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3425548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3425555Z 2025-05-07T20:32:52.3425808Z self = 2025-05-07T20:32:52.3426623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3427142Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ac44360>} 2025-05-07T20:32:52.3427936Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3428136Z context = 2025-05-07T20:32:52.3428140Z 2025-05-07T20:32:52.3428317Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3428588Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3428698Z module_map=module_map) 2025-05-07T20:32:52.3428862Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3428962Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3429040Z E ^ 2025-05-07T20:32:52.3429413Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3429418Z 2025-05-07T20:32:52.3429849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3429860Z 2025-05-07T20:32:52.3429961Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3430188Z self=, 2025-05-07T20:32:52.3430261Z T=4096, 2025-05-07T20:32:52.3430337Z D=5120, 2025-05-07T20:32:52.3430420Z scale_ub=1200.0, 2025-05-07T20:32:52.3430504Z contiguous=False, 2025-05-07T20:32:52.3430590Z compiled=False, 2025-05-07T20:32:52.3430659Z ) 2025-05-07T20:32:52.3430883Z self = 2025-05-07T20:32:52.3431059Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:52.3431063Z 2025-05-07T20:32:52.3431136Z @given( 2025-05-07T20:32:52.3431257Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3431354Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3431463Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3431581Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3431695Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3431765Z ) 2025-05-07T20:32:52.3432017Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3432107Z def test_silu_mul_quant( 2025-05-07T20:32:52.3432181Z self, 2025-05-07T20:32:52.3432254Z T: int, 2025-05-07T20:32:52.3432324Z D: int, 2025-05-07T20:32:52.3435925Z scale_ub: Optional[float], 2025-05-07T20:32:52.3436037Z contiguous: bool, 2025-05-07T20:32:52.3436127Z compiled: bool, 2025-05-07T20:32:52.3436206Z ) -> None: 2025-05-07T20:32:52.3436303Z torch.manual_seed(2025) 2025-05-07T20:32:52.3436539Z 2025-05-07T20:32:52.3436719Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3436795Z 2025-05-07T20:32:52.3436893Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3437018Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3437230Z x = x_sign * x_clamp 2025-05-07T20:32:52.3437315Z x0 = x[:, :D] 2025-05-07T20:32:52.3437397Z x1 = x[:, D:] 2025-05-07T20:32:52.3437478Z 2025-05-07T20:32:52.3437572Z if contiguous: 2025-05-07T20:32:52.3437666Z x0 = x0.contiguous() 2025-05-07T20:32:52.3437762Z x1 = x1.contiguous() 2025-05-07T20:32:52.3437839Z 2025-05-07T20:32:52.3437931Z if scale_ub is not None: 2025-05-07T20:32:52.3438045Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3438183Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3438267Z ) 2025-05-07T20:32:52.3438350Z else: 2025-05-07T20:32:52.3438453Z scale_ub_tensor = None 2025-05-07T20:32:52.3438530Z 2025-05-07T20:32:52.3438658Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3438748Z op = silu_mul_quant 2025-05-07T20:32:52.3438833Z if compiled: 2025-05-07T20:32:52.3438939Z op = torch.compile(op) 2025-05-07T20:32:52.3439041Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3439116Z 2025-05-07T20:32:52.3439204Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3439209Z 2025-05-07T20:32:52.3439308Z moe/activation_test.py:117: 2025-05-07T20:32:52.3439440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3439540Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3439642Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3440167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3440270Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3440648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3440873Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3441238Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3441336Z kernel = self.compile( 2025-05-07T20:32:52.3441739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3441916Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3442044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3442049Z 2025-05-07T20:32:52.3442254Z self = 2025-05-07T20:32:52.3443068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3443582Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ac46700>} 2025-05-07T20:32:52.3444373Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3444563Z context = 2025-05-07T20:32:52.3444568Z 2025-05-07T20:32:52.3444737Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3445088Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3445195Z module_map=module_map) 2025-05-07T20:32:52.3445358Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3445455Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3445613Z E ^ 2025-05-07T20:32:52.3445983Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3445988Z 2025-05-07T20:32:52.3446419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3446424Z 2025-05-07T20:32:52.3446527Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3446756Z self=, 2025-05-07T20:32:52.3446834Z T=4096, 2025-05-07T20:32:52.3446915Z D=5120, 2025-05-07T20:32:52.3447002Z scale_ub=1200.0, 2025-05-07T20:32:52.3447090Z contiguous=False, 2025-05-07T20:32:52.3447185Z compiled=True, 2025-05-07T20:32:52.3447264Z ) 2025-05-07T20:32:52.3447491Z self = 2025-05-07T20:32:52.3447670Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:52.3447679Z 2025-05-07T20:32:52.3447759Z @given( 2025-05-07T20:32:52.3447886Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3447989Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3448110Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3448234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3448347Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3448419Z ) 2025-05-07T20:32:52.3448673Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3448766Z def test_silu_mul_quant( 2025-05-07T20:32:52.3448844Z self, 2025-05-07T20:32:52.3448920Z T: int, 2025-05-07T20:32:52.3448998Z D: int, 2025-05-07T20:32:52.3449098Z scale_ub: Optional[float], 2025-05-07T20:32:52.3449188Z contiguous: bool, 2025-05-07T20:32:52.3449274Z compiled: bool, 2025-05-07T20:32:52.3449351Z ) -> None: 2025-05-07T20:32:52.3449450Z torch.manual_seed(2025) 2025-05-07T20:32:52.3449522Z 2025-05-07T20:32:52.3449693Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3449768Z 2025-05-07T20:32:52.3449859Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3449983Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3450072Z x = x_sign * x_clamp 2025-05-07T20:32:52.3450153Z x0 = x[:, :D] 2025-05-07T20:32:52.3450232Z x1 = x[:, D:] 2025-05-07T20:32:52.3450304Z 2025-05-07T20:32:52.3450389Z if contiguous: 2025-05-07T20:32:52.3450479Z x0 = x0.contiguous() 2025-05-07T20:32:52.3450566Z x1 = x1.contiguous() 2025-05-07T20:32:52.3450646Z 2025-05-07T20:32:52.3450736Z if scale_ub is not None: 2025-05-07T20:32:52.3450840Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3450974Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3451054Z ) 2025-05-07T20:32:52.3451129Z else: 2025-05-07T20:32:52.3451224Z scale_ub_tensor = None 2025-05-07T20:32:52.3451296Z 2025-05-07T20:32:52.3451429Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3451523Z op = silu_mul_quant 2025-05-07T20:32:52.3451608Z if compiled: 2025-05-07T20:32:52.3451709Z op = torch.compile(op) 2025-05-07T20:32:52.3451812Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3451883Z 2025-05-07T20:32:52.3451979Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3451984Z 2025-05-07T20:32:52.3452083Z moe/activation_test.py:117: 2025-05-07T20:32:52.3452296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3452399Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3452497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3452887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3453082Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3453599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3453699Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3454071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3454298Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3454776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3454873Z kernel = self.compile( 2025-05-07T20:32:52.3455279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3455454Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3455589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3455593Z 2025-05-07T20:32:52.3455805Z self = 2025-05-07T20:32:52.3456612Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3457133Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127b37b2e0>} 2025-05-07T20:32:52.3457929Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3458120Z context = 2025-05-07T20:32:52.3458131Z 2025-05-07T20:32:52.3458298Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3458569Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3458678Z module_map=module_map) 2025-05-07T20:32:52.3458840Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3458938Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3459017Z E ^ 2025-05-07T20:32:52.3459386Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3459391Z 2025-05-07T20:32:52.3459834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3459839Z 2025-05-07T20:32:52.3459941Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3460174Z self=, 2025-05-07T20:32:52.3460254Z T=2048, 2025-05-07T20:32:52.3460330Z D=7168, 2025-05-07T20:32:52.3460412Z scale_ub=1200.0, 2025-05-07T20:32:52.3460500Z contiguous=False, 2025-05-07T20:32:52.3460584Z compiled=False, 2025-05-07T20:32:52.3460655Z ) 2025-05-07T20:32:52.3460880Z self = 2025-05-07T20:32:52.3461057Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:52.3461062Z 2025-05-07T20:32:52.3461144Z @given( 2025-05-07T20:32:52.3461266Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3461452Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3461574Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3461690Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3461804Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3461957Z ) 2025-05-07T20:32:52.3462211Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3462311Z def test_silu_mul_quant( 2025-05-07T20:32:52.3462389Z self, 2025-05-07T20:32:52.3462468Z T: int, 2025-05-07T20:32:52.3462552Z D: int, 2025-05-07T20:32:52.3462653Z scale_ub: Optional[float], 2025-05-07T20:32:52.3462746Z contiguous: bool, 2025-05-07T20:32:52.3462837Z compiled: bool, 2025-05-07T20:32:52.3462915Z ) -> None: 2025-05-07T20:32:52.3463010Z torch.manual_seed(2025) 2025-05-07T20:32:52.3463088Z 2025-05-07T20:32:52.3463262Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3463343Z 2025-05-07T20:32:52.3463440Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3463567Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3463660Z x = x_sign * x_clamp 2025-05-07T20:32:52.3463749Z x0 = x[:, :D] 2025-05-07T20:32:52.3463834Z x1 = x[:, D:] 2025-05-07T20:32:52.3463911Z 2025-05-07T20:32:52.3463996Z if contiguous: 2025-05-07T20:32:52.3464091Z x0 = x0.contiguous() 2025-05-07T20:32:52.3464188Z x1 = x1.contiguous() 2025-05-07T20:32:52.3464262Z 2025-05-07T20:32:52.3464355Z if scale_ub is not None: 2025-05-07T20:32:52.3464465Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3464600Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3464680Z ) 2025-05-07T20:32:52.3464766Z else: 2025-05-07T20:32:52.3464859Z scale_ub_tensor = None 2025-05-07T20:32:52.3464935Z 2025-05-07T20:32:52.3465068Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3465155Z op = silu_mul_quant 2025-05-07T20:32:52.3465239Z if compiled: 2025-05-07T20:32:52.3465333Z op = torch.compile(op) 2025-05-07T20:32:52.3465437Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3465518Z 2025-05-07T20:32:52.3465604Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3465609Z 2025-05-07T20:32:52.3465701Z moe/activation_test.py:117: 2025-05-07T20:32:52.3465835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3465931Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3466028Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3466558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3466652Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3467036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3467264Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3467616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3467714Z kernel = self.compile( 2025-05-07T20:32:52.3468111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3468283Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3468412Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3468417Z 2025-05-07T20:32:52.3468622Z self = 2025-05-07T20:32:52.3469517Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3470031Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f129592a340>} 2025-05-07T20:32:52.3470899Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3471093Z context = 2025-05-07T20:32:52.3471098Z 2025-05-07T20:32:52.3471266Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3471541Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3471648Z module_map=module_map) 2025-05-07T20:32:52.3471818Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3471917Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3471998Z E ^ 2025-05-07T20:32:52.3472365Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3472374Z 2025-05-07T20:32:52.3472805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3472809Z 2025-05-07T20:32:52.3472910Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3473140Z self=, 2025-05-07T20:32:52.3473216Z T=1, 2025-05-07T20:32:52.3473303Z D=7168, 2025-05-07T20:32:52.3473385Z scale_ub=None, 2025-05-07T20:32:52.3473468Z contiguous=True, 2025-05-07T20:32:52.3473554Z compiled=False, 2025-05-07T20:32:52.3473628Z ) 2025-05-07T20:32:52.3473856Z self = 2025-05-07T20:32:52.3474027Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:52.3474031Z 2025-05-07T20:32:52.3474109Z @given( 2025-05-07T20:32:52.3474238Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3474360Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3474499Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3474621Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3474735Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3474811Z ) 2025-05-07T20:32:52.3475064Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3475159Z def test_silu_mul_quant( 2025-05-07T20:32:52.3475236Z self, 2025-05-07T20:32:52.3475316Z T: int, 2025-05-07T20:32:52.3475393Z D: int, 2025-05-07T20:32:52.3475489Z scale_ub: Optional[float], 2025-05-07T20:32:52.3475585Z contiguous: bool, 2025-05-07T20:32:52.3475671Z compiled: bool, 2025-05-07T20:32:52.3475746Z ) -> None: 2025-05-07T20:32:52.3475836Z torch.manual_seed(2025) 2025-05-07T20:32:52.3475905Z 2025-05-07T20:32:52.3476076Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3476150Z 2025-05-07T20:32:52.3476237Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3476360Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3476442Z x = x_sign * x_clamp 2025-05-07T20:32:52.3476519Z x0 = x[:, :D] 2025-05-07T20:32:52.3476595Z x1 = x[:, D:] 2025-05-07T20:32:52.3476662Z 2025-05-07T20:32:52.3476742Z if contiguous: 2025-05-07T20:32:52.3476835Z x0 = x0.contiguous() 2025-05-07T20:32:52.3476922Z x1 = x1.contiguous() 2025-05-07T20:32:52.3476994Z 2025-05-07T20:32:52.3477082Z if scale_ub is not None: 2025-05-07T20:32:52.3477266Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3477401Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3477470Z ) 2025-05-07T20:32:52.3477545Z else: 2025-05-07T20:32:52.3477637Z scale_ub_tensor = None 2025-05-07T20:32:52.3477785Z 2025-05-07T20:32:52.3477911Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3478002Z op = silu_mul_quant 2025-05-07T20:32:52.3478082Z if compiled: 2025-05-07T20:32:52.3478175Z op = torch.compile(op) 2025-05-07T20:32:52.3478281Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3478352Z 2025-05-07T20:32:52.3478442Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3478449Z 2025-05-07T20:32:52.3478540Z moe/activation_test.py:117: 2025-05-07T20:32:52.3478674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3478773Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3478873Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3479394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3479493Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3479869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3480093Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3480453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3480543Z kernel = self.compile( 2025-05-07T20:32:52.3480944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3481118Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3481247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3481251Z 2025-05-07T20:32:52.3481455Z self = 2025-05-07T20:32:52.3482262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3482784Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1295a031a0>} 2025-05-07T20:32:52.3483575Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3483768Z context = 2025-05-07T20:32:52.3483773Z 2025-05-07T20:32:52.3483940Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3484210Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3484319Z module_map=module_map) 2025-05-07T20:32:52.3484509Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3484623Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3484700Z E ^ 2025-05-07T20:32:52.3485068Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3485072Z 2025-05-07T20:32:52.3485509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3485513Z 2025-05-07T20:32:52.3485617Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3485943Z self=, 2025-05-07T20:32:52.3486027Z T=16384, 2025-05-07T20:32:52.3486101Z D=7168, 2025-05-07T20:32:52.3486179Z scale_ub=1200.0, 2025-05-07T20:32:52.3486264Z contiguous=False, 2025-05-07T20:32:52.3486340Z compiled=True, 2025-05-07T20:32:52.3486492Z ) 2025-05-07T20:32:52.3486714Z self = 2025-05-07T20:32:52.3486891Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:52.3486895Z 2025-05-07T20:32:52.3486974Z @given( 2025-05-07T20:32:52.3487092Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3487214Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3487375Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3487537Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3487684Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3487761Z ) 2025-05-07T20:32:52.3488021Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3488120Z def test_silu_mul_quant( 2025-05-07T20:32:52.3488195Z self, 2025-05-07T20:32:52.3488269Z T: int, 2025-05-07T20:32:52.3488348Z D: int, 2025-05-07T20:32:52.3488447Z scale_ub: Optional[float], 2025-05-07T20:32:52.3488532Z contiguous: bool, 2025-05-07T20:32:52.3488621Z compiled: bool, 2025-05-07T20:32:52.3488700Z ) -> None: 2025-05-07T20:32:52.3488792Z torch.manual_seed(2025) 2025-05-07T20:32:52.3488869Z 2025-05-07T20:32:52.3489037Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3489108Z 2025-05-07T20:32:52.3489203Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3489331Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3489420Z x = x_sign * x_clamp 2025-05-07T20:32:52.3489495Z x0 = x[:, :D] 2025-05-07T20:32:52.3489571Z x1 = x[:, D:] 2025-05-07T20:32:52.3489648Z 2025-05-07T20:32:52.3489729Z if contiguous: 2025-05-07T20:32:52.3489818Z x0 = x0.contiguous() 2025-05-07T20:32:52.3489906Z x1 = x1.contiguous() 2025-05-07T20:32:52.3489976Z 2025-05-07T20:32:52.3490068Z if scale_ub is not None: 2025-05-07T20:32:52.3490178Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3490311Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3490383Z ) 2025-05-07T20:32:52.3490459Z else: 2025-05-07T20:32:52.3490551Z scale_ub_tensor = None 2025-05-07T20:32:52.3490625Z 2025-05-07T20:32:52.3490752Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3490836Z op = silu_mul_quant 2025-05-07T20:32:52.3490920Z if compiled: 2025-05-07T20:32:52.3491014Z op = torch.compile(op) 2025-05-07T20:32:52.3491115Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3491192Z 2025-05-07T20:32:52.3491283Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3491288Z 2025-05-07T20:32:52.3491383Z moe/activation_test.py:117: 2025-05-07T20:32:52.3491518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3491618Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3491712Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3492099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3492193Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3492717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3492812Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3493184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3493592Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3493957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3494053Z kernel = self.compile( 2025-05-07T20:32:52.3494620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3494799Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3494935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3494940Z 2025-05-07T20:32:52.3495146Z self = 2025-05-07T20:32:52.3495964Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3496483Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f129587da80>} 2025-05-07T20:32:52.3497282Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3497480Z context = 2025-05-07T20:32:52.3497485Z 2025-05-07T20:32:52.3497659Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3497931Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3498035Z module_map=module_map) 2025-05-07T20:32:52.3498200Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3498297Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3498385Z E ^ 2025-05-07T20:32:52.3498761Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3498766Z 2025-05-07T20:32:52.3499200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3499209Z 2025-05-07T20:32:52.3499317Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3499543Z self=, 2025-05-07T20:32:52.3499620Z T=1, 2025-05-07T20:32:52.3499697Z D=7168, 2025-05-07T20:32:52.3499773Z scale_ub=None, 2025-05-07T20:32:52.3499857Z contiguous=False, 2025-05-07T20:32:52.3499935Z compiled=False, 2025-05-07T20:32:52.3500003Z ) 2025-05-07T20:32:52.3500223Z self = 2025-05-07T20:32:52.3500386Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:52.3500395Z 2025-05-07T20:32:52.3500470Z @given( 2025-05-07T20:32:52.3500586Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3500683Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3500797Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3500914Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3501022Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3501093Z ) 2025-05-07T20:32:52.3501339Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3501428Z def test_silu_mul_quant( 2025-05-07T20:32:52.3501504Z self, 2025-05-07T20:32:52.3501575Z T: int, 2025-05-07T20:32:52.3501648Z D: int, 2025-05-07T20:32:52.3501747Z scale_ub: Optional[float], 2025-05-07T20:32:52.3501830Z contiguous: bool, 2025-05-07T20:32:52.3501911Z compiled: bool, 2025-05-07T20:32:52.3501987Z ) -> None: 2025-05-07T20:32:52.3502162Z torch.manual_seed(2025) 2025-05-07T20:32:52.3502239Z 2025-05-07T20:32:52.3502408Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3502480Z 2025-05-07T20:32:52.3502578Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3502777Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3502863Z x = x_sign * x_clamp 2025-05-07T20:32:52.3502946Z x0 = x[:, :D] 2025-05-07T20:32:52.3503025Z x1 = x[:, D:] 2025-05-07T20:32:52.3503098Z 2025-05-07T20:32:52.3503183Z if contiguous: 2025-05-07T20:32:52.3503273Z x0 = x0.contiguous() 2025-05-07T20:32:52.3503363Z x1 = x1.contiguous() 2025-05-07T20:32:52.3503442Z 2025-05-07T20:32:52.3503531Z if scale_ub is not None: 2025-05-07T20:32:52.3503636Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3503773Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3503852Z ) 2025-05-07T20:32:52.3503928Z else: 2025-05-07T20:32:52.3504021Z scale_ub_tensor = None 2025-05-07T20:32:52.3504095Z 2025-05-07T20:32:52.3504222Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3504316Z op = silu_mul_quant 2025-05-07T20:32:52.3504397Z if compiled: 2025-05-07T20:32:52.3504502Z op = torch.compile(op) 2025-05-07T20:32:52.3504604Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3504673Z 2025-05-07T20:32:52.3504763Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3504767Z 2025-05-07T20:32:52.3504859Z moe/activation_test.py:117: 2025-05-07T20:32:52.3504990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3505085Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3505180Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3505710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3505806Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3506179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3506414Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3506767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3506857Z kernel = self.compile( 2025-05-07T20:32:52.3507257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3507430Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3507563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3507567Z 2025-05-07T20:32:52.3507776Z self = 2025-05-07T20:32:52.3508587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3509103Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f12967d0f40>} 2025-05-07T20:32:52.3509894Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3510093Z context = 2025-05-07T20:32:52.3510097Z 2025-05-07T20:32:52.3510261Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3510622Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3510729Z module_map=module_map) 2025-05-07T20:32:52.3510889Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3511067Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3511146Z E ^ 2025-05-07T20:32:52.3511518Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3511525Z 2025-05-07T20:32:52.3511960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3511964Z 2025-05-07T20:32:52.3512065Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3512295Z self=, 2025-05-07T20:32:52.3512373Z T=2048, 2025-05-07T20:32:52.3512453Z D=7168, 2025-05-07T20:32:52.3512542Z scale_ub=None, 2025-05-07T20:32:52.3512639Z contiguous=False, 2025-05-07T20:32:52.3512725Z compiled=True, 2025-05-07T20:32:52.3512804Z ) 2025-05-07T20:32:52.3513029Z self = 2025-05-07T20:32:52.3513219Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.3513223Z 2025-05-07T20:32:52.3513304Z @given( 2025-05-07T20:32:52.3513425Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3513532Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3513648Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3513768Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3513914Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3514003Z ) 2025-05-07T20:32:52.3514279Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3514376Z def test_silu_mul_quant( 2025-05-07T20:32:52.3514462Z self, 2025-05-07T20:32:52.3514547Z T: int, 2025-05-07T20:32:52.3514627Z D: int, 2025-05-07T20:32:52.3514729Z scale_ub: Optional[float], 2025-05-07T20:32:52.3514826Z contiguous: bool, 2025-05-07T20:32:52.3514919Z compiled: bool, 2025-05-07T20:32:52.3514993Z ) -> None: 2025-05-07T20:32:52.3515088Z torch.manual_seed(2025) 2025-05-07T20:32:52.3515160Z 2025-05-07T20:32:52.3515329Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3515405Z 2025-05-07T20:32:52.3515495Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3515617Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3515706Z x = x_sign * x_clamp 2025-05-07T20:32:52.3515784Z x0 = x[:, :D] 2025-05-07T20:32:52.3515869Z x1 = x[:, D:] 2025-05-07T20:32:52.3515939Z 2025-05-07T20:32:52.3516020Z if contiguous: 2025-05-07T20:32:52.3516113Z x0 = x0.contiguous() 2025-05-07T20:32:52.3516204Z x1 = x1.contiguous() 2025-05-07T20:32:52.3516274Z 2025-05-07T20:32:52.3516368Z if scale_ub is not None: 2025-05-07T20:32:52.3516473Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3516607Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3516691Z ) 2025-05-07T20:32:52.3516768Z else: 2025-05-07T20:32:52.3516860Z scale_ub_tensor = None 2025-05-07T20:32:52.3516934Z 2025-05-07T20:32:52.3517063Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3517157Z op = silu_mul_quant 2025-05-07T20:32:52.3517240Z if compiled: 2025-05-07T20:32:52.3517338Z op = torch.compile(op) 2025-05-07T20:32:52.3517446Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3517518Z 2025-05-07T20:32:52.3517607Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3517611Z 2025-05-07T20:32:52.3517711Z moe/activation_test.py:117: 2025-05-07T20:32:52.3517950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3518049Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3518158Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3518624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3518720Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3519242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3519337Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3519718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3519946Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3520312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3520407Z kernel = self.compile( 2025-05-07T20:32:52.3520809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3520995Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3521127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3521131Z 2025-05-07T20:32:52.3521336Z self = 2025-05-07T20:32:52.3522149Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3522668Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f129f27f7e0>} 2025-05-07T20:32:52.3523469Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3523665Z context = 2025-05-07T20:32:52.3523670Z 2025-05-07T20:32:52.3523839Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3524135Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3524254Z module_map=module_map) 2025-05-07T20:32:52.3524429Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3524525Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3524601Z E ^ 2025-05-07T20:32:52.3524978Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3524983Z 2025-05-07T20:32:52.3525609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3525616Z 2025-05-07T20:32:52.3525756Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3525986Z self=, 2025-05-07T20:32:52.3526063Z T=4096, 2025-05-07T20:32:52.3526140Z D=7168, 2025-05-07T20:32:52.3526225Z scale_ub=None, 2025-05-07T20:32:52.3526311Z contiguous=False, 2025-05-07T20:32:52.3526399Z compiled=True, 2025-05-07T20:32:52.3526474Z ) 2025-05-07T20:32:52.3526696Z self = 2025-05-07T20:32:52.3526875Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.3526879Z 2025-05-07T20:32:52.3526957Z @given( 2025-05-07T20:32:52.3527217Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3527320Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3527434Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3527555Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3527777Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3527853Z ) 2025-05-07T20:32:52.3528106Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3528201Z def test_silu_mul_quant( 2025-05-07T20:32:52.3528285Z self, 2025-05-07T20:32:52.3528360Z T: int, 2025-05-07T20:32:52.3528439Z D: int, 2025-05-07T20:32:52.3528540Z scale_ub: Optional[float], 2025-05-07T20:32:52.3528630Z contiguous: bool, 2025-05-07T20:32:52.3528714Z compiled: bool, 2025-05-07T20:32:52.3528793Z ) -> None: 2025-05-07T20:32:52.3528884Z torch.manual_seed(2025) 2025-05-07T20:32:52.3528955Z 2025-05-07T20:32:52.3529132Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3529203Z 2025-05-07T20:32:52.3529290Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3529417Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3529508Z x = x_sign * x_clamp 2025-05-07T20:32:52.3529586Z x0 = x[:, :D] 2025-05-07T20:32:52.3529666Z x1 = x[:, D:] 2025-05-07T20:32:52.3529734Z 2025-05-07T20:32:52.3529816Z if contiguous: 2025-05-07T20:32:52.3529904Z x0 = x0.contiguous() 2025-05-07T20:32:52.3529989Z x1 = x1.contiguous() 2025-05-07T20:32:52.3530066Z 2025-05-07T20:32:52.3530152Z if scale_ub is not None: 2025-05-07T20:32:52.3530254Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3530388Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3530457Z ) 2025-05-07T20:32:52.3530533Z else: 2025-05-07T20:32:52.3530627Z scale_ub_tensor = None 2025-05-07T20:32:52.3530698Z 2025-05-07T20:32:52.3530825Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3530915Z op = silu_mul_quant 2025-05-07T20:32:52.3530994Z if compiled: 2025-05-07T20:32:52.3531093Z op = torch.compile(op) 2025-05-07T20:32:52.3531202Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3531276Z 2025-05-07T20:32:52.3531367Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3531372Z 2025-05-07T20:32:52.3531465Z moe/activation_test.py:117: 2025-05-07T20:32:52.3531596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3531696Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3531793Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3532177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3532268Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3532791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3532889Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3533262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3533491Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3533844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3533934Z kernel = self.compile( 2025-05-07T20:32:52.3534336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3534643Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3534773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3534871Z 2025-05-07T20:32:52.3535082Z self = 2025-05-07T20:32:52.3535894Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3536485Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f129674f060>} 2025-05-07T20:32:52.3537277Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3537467Z context = 2025-05-07T20:32:52.3537471Z 2025-05-07T20:32:52.3537647Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3537915Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3538021Z module_map=module_map) 2025-05-07T20:32:52.3538187Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3538282Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3538360Z E ^ 2025-05-07T20:32:52.3538726Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3538731Z 2025-05-07T20:32:52.3539160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3539167Z 2025-05-07T20:32:52.3539268Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3539491Z self=, 2025-05-07T20:32:52.3539569Z T=16384, 2025-05-07T20:32:52.3539648Z D=5120, 2025-05-07T20:32:52.3539727Z scale_ub=1200.0, 2025-05-07T20:32:52.3539810Z contiguous=False, 2025-05-07T20:32:52.3539889Z compiled=False, 2025-05-07T20:32:52.3539959Z ) 2025-05-07T20:32:52.3540179Z self = 2025-05-07T20:32:52.3540366Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:52.3540370Z 2025-05-07T20:32:52.3540446Z @given( 2025-05-07T20:32:52.3540566Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3540665Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3540785Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3540902Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3541017Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3541096Z ) 2025-05-07T20:32:52.3541349Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3541448Z def test_silu_mul_quant( 2025-05-07T20:32:52.3541529Z self, 2025-05-07T20:32:52.3541605Z T: int, 2025-05-07T20:32:52.3541685Z D: int, 2025-05-07T20:32:52.3541785Z scale_ub: Optional[float], 2025-05-07T20:32:52.3541877Z contiguous: bool, 2025-05-07T20:32:52.3541967Z compiled: bool, 2025-05-07T20:32:52.3542040Z ) -> None: 2025-05-07T20:32:52.3542135Z torch.manual_seed(2025) 2025-05-07T20:32:52.3542213Z 2025-05-07T20:32:52.3542383Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3542457Z 2025-05-07T20:32:52.3542550Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3542674Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3542761Z x = x_sign * x_clamp 2025-05-07T20:32:52.3542848Z x0 = x[:, :D] 2025-05-07T20:32:52.3542927Z x1 = x[:, D:] 2025-05-07T20:32:52.3542998Z 2025-05-07T20:32:52.3543174Z if contiguous: 2025-05-07T20:32:52.3543269Z x0 = x0.contiguous() 2025-05-07T20:32:52.3543360Z x1 = x1.contiguous() 2025-05-07T20:32:52.3543434Z 2025-05-07T20:32:52.3543525Z if scale_ub is not None: 2025-05-07T20:32:52.3543631Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3543867Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3543945Z ) 2025-05-07T20:32:52.3544025Z else: 2025-05-07T20:32:52.3544120Z scale_ub_tensor = None 2025-05-07T20:32:52.3544196Z 2025-05-07T20:32:52.3544330Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3544419Z op = silu_mul_quant 2025-05-07T20:32:52.3544502Z if compiled: 2025-05-07T20:32:52.3544606Z op = torch.compile(op) 2025-05-07T20:32:52.3544712Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3544786Z 2025-05-07T20:32:52.3544875Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3544886Z 2025-05-07T20:32:52.3544982Z moe/activation_test.py:117: 2025-05-07T20:32:52.3545118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3545218Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3545322Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3545852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3545950Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3546329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3546559Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3546918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3547015Z kernel = self.compile( 2025-05-07T20:32:52.3547422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3547597Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3547731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3547741Z 2025-05-07T20:32:52.3547947Z self = 2025-05-07T20:32:52.3548759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3549273Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1279473a60>} 2025-05-07T20:32:52.3550075Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3550268Z context = 2025-05-07T20:32:52.3550277Z 2025-05-07T20:32:52.3550443Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3550719Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3550824Z module_map=module_map) 2025-05-07T20:32:52.3550984Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3551086Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3551161Z E ^ 2025-05-07T20:32:52.3551535Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3551539Z 2025-05-07T20:32:52.3552054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3552059Z 2025-05-07T20:32:52.3552164Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3552393Z self=, 2025-05-07T20:32:52.3552549Z T=16384, 2025-05-07T20:32:52.3552633Z D=5120, 2025-05-07T20:32:52.3552720Z scale_ub=1200.0, 2025-05-07T20:32:52.3552807Z contiguous=True, 2025-05-07T20:32:52.3552897Z compiled=True, 2025-05-07T20:32:52.3552974Z ) 2025-05-07T20:32:52.3553197Z self = 2025-05-07T20:32:52.3553389Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:52.3553394Z 2025-05-07T20:32:52.3553473Z @given( 2025-05-07T20:32:52.3553595Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3553699Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3553824Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3553969Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3554097Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3554191Z ) 2025-05-07T20:32:52.3554451Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3554550Z def test_silu_mul_quant( 2025-05-07T20:32:52.3554630Z self, 2025-05-07T20:32:52.3554712Z T: int, 2025-05-07T20:32:52.3554790Z D: int, 2025-05-07T20:32:52.3554893Z scale_ub: Optional[float], 2025-05-07T20:32:52.3554992Z contiguous: bool, 2025-05-07T20:32:52.3555081Z compiled: bool, 2025-05-07T20:32:52.3555158Z ) -> None: 2025-05-07T20:32:52.3555254Z torch.manual_seed(2025) 2025-05-07T20:32:52.3555325Z 2025-05-07T20:32:52.3555499Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3555571Z 2025-05-07T20:32:52.3555667Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3555793Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3555881Z x = x_sign * x_clamp 2025-05-07T20:32:52.3555960Z x0 = x[:, :D] 2025-05-07T20:32:52.3556042Z x1 = x[:, D:] 2025-05-07T20:32:52.3556118Z 2025-05-07T20:32:52.3556199Z if contiguous: 2025-05-07T20:32:52.3556298Z x0 = x0.contiguous() 2025-05-07T20:32:52.3556386Z x1 = x1.contiguous() 2025-05-07T20:32:52.3559870Z 2025-05-07T20:32:52.3559984Z if scale_ub is not None: 2025-05-07T20:32:52.3560104Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3560244Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3560322Z ) 2025-05-07T20:32:52.3560405Z else: 2025-05-07T20:32:52.3560505Z scale_ub_tensor = None 2025-05-07T20:32:52.3560580Z 2025-05-07T20:32:52.3560717Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3560815Z op = silu_mul_quant 2025-05-07T20:32:52.3560902Z if compiled: 2025-05-07T20:32:52.3561009Z op = torch.compile(op) 2025-05-07T20:32:52.3561114Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3561194Z 2025-05-07T20:32:52.3561287Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3561292Z 2025-05-07T20:32:52.3561390Z moe/activation_test.py:117: 2025-05-07T20:32:52.3561527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3561627Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3561725Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3562118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3562212Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3562836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3562938Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3563313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3563542Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3564006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3564117Z kernel = self.compile( 2025-05-07T20:32:52.3564520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3564696Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3564831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3564835Z 2025-05-07T20:32:52.3565040Z self = 2025-05-07T20:32:52.3565852Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3566374Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127919efc0>} 2025-05-07T20:32:52.3567163Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3567355Z context = 2025-05-07T20:32:52.3567360Z 2025-05-07T20:32:52.3567526Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3567802Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3567907Z module_map=module_map) 2025-05-07T20:32:52.3568067Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3568166Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3568246Z E ^ 2025-05-07T20:32:52.3568619Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3568624Z 2025-05-07T20:32:52.3569062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3569066Z 2025-05-07T20:32:52.3569168Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3569398Z self=, 2025-05-07T20:32:52.3569476Z T=16384, 2025-05-07T20:32:52.3569552Z D=5120, 2025-05-07T20:32:52.3569636Z scale_ub=None, 2025-05-07T20:32:52.3569722Z contiguous=False, 2025-05-07T20:32:52.3569807Z compiled=True, 2025-05-07T20:32:52.3569884Z ) 2025-05-07T20:32:52.3570103Z self = 2025-05-07T20:32:52.3570282Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.3570291Z 2025-05-07T20:32:52.3570371Z @given( 2025-05-07T20:32:52.3570490Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3570590Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3570702Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3570819Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3570935Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3571010Z ) 2025-05-07T20:32:52.3571260Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3571355Z def test_silu_mul_quant( 2025-05-07T20:32:52.3571432Z self, 2025-05-07T20:32:52.3571591Z T: int, 2025-05-07T20:32:52.3571673Z D: int, 2025-05-07T20:32:52.3571776Z scale_ub: Optional[float], 2025-05-07T20:32:52.3571866Z contiguous: bool, 2025-05-07T20:32:52.3571950Z compiled: bool, 2025-05-07T20:32:52.3572105Z ) -> None: 2025-05-07T20:32:52.3572204Z torch.manual_seed(2025) 2025-05-07T20:32:52.3572275Z 2025-05-07T20:32:52.3572445Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3572521Z 2025-05-07T20:32:52.3572613Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3572735Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3572829Z x = x_sign * x_clamp 2025-05-07T20:32:52.3572908Z x0 = x[:, :D] 2025-05-07T20:32:52.3572984Z x1 = x[:, D:] 2025-05-07T20:32:52.3573056Z 2025-05-07T20:32:52.3573139Z if contiguous: 2025-05-07T20:32:52.3573231Z x0 = x0.contiguous() 2025-05-07T20:32:52.3573323Z x1 = x1.contiguous() 2025-05-07T20:32:52.3573401Z 2025-05-07T20:32:52.3573492Z if scale_ub is not None: 2025-05-07T20:32:52.3573600Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3573737Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3573818Z ) 2025-05-07T20:32:52.3573894Z else: 2025-05-07T20:32:52.3573986Z scale_ub_tensor = None 2025-05-07T20:32:52.3574059Z 2025-05-07T20:32:52.3574187Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3574276Z op = silu_mul_quant 2025-05-07T20:32:52.3574362Z if compiled: 2025-05-07T20:32:52.3574579Z op = torch.compile(op) 2025-05-07T20:32:52.3574684Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3574756Z 2025-05-07T20:32:52.3574844Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3574849Z 2025-05-07T20:32:52.3574950Z moe/activation_test.py:117: 2025-05-07T20:32:52.3575087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3575186Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3575286Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3575670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3575768Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3576289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3576386Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3576767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3576994Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3577350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3577450Z kernel = self.compile( 2025-05-07T20:32:52.3577850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3578025Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3578165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3578169Z 2025-05-07T20:32:52.3578375Z self = 2025-05-07T20:32:52.3579184Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3579696Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127a003a60>} 2025-05-07T20:32:52.3580578Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3580868Z context = 2025-05-07T20:32:52.3580872Z 2025-05-07T20:32:52.3581041Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3581316Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3581426Z module_map=module_map) 2025-05-07T20:32:52.3581597Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3581698Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3581779Z E ^ 2025-05-07T20:32:52.3582153Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3582163Z 2025-05-07T20:32:52.3582597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3582602Z 2025-05-07T20:32:52.3582704Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3582938Z self=, 2025-05-07T20:32:52.3583014Z T=2048, 2025-05-07T20:32:52.3583093Z D=5120, 2025-05-07T20:32:52.3583174Z scale_ub=None, 2025-05-07T20:32:52.3583258Z contiguous=False, 2025-05-07T20:32:52.3583343Z compiled=True, 2025-05-07T20:32:52.3583415Z ) 2025-05-07T20:32:52.3583637Z self = 2025-05-07T20:32:52.3583817Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.3583821Z 2025-05-07T20:32:52.3583899Z @given( 2025-05-07T20:32:52.3584016Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3584123Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3584239Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3584362Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3584476Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3584554Z ) 2025-05-07T20:32:52.3584808Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3584900Z def test_silu_mul_quant( 2025-05-07T20:32:52.3584974Z self, 2025-05-07T20:32:52.3585053Z T: int, 2025-05-07T20:32:52.3585129Z D: int, 2025-05-07T20:32:52.3585224Z scale_ub: Optional[float], 2025-05-07T20:32:52.3585317Z contiguous: bool, 2025-05-07T20:32:52.3585401Z compiled: bool, 2025-05-07T20:32:52.3585479Z ) -> None: 2025-05-07T20:32:52.3585583Z torch.manual_seed(2025) 2025-05-07T20:32:52.3585661Z 2025-05-07T20:32:52.3585837Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3585918Z 2025-05-07T20:32:52.3586017Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3586145Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3586235Z x = x_sign * x_clamp 2025-05-07T20:32:52.3586320Z x0 = x[:, :D] 2025-05-07T20:32:52.3586406Z x1 = x[:, D:] 2025-05-07T20:32:52.3586480Z 2025-05-07T20:32:52.3586566Z if contiguous: 2025-05-07T20:32:52.3586667Z x0 = x0.contiguous() 2025-05-07T20:32:52.3586760Z x1 = x1.contiguous() 2025-05-07T20:32:52.3586835Z 2025-05-07T20:32:52.3586932Z if scale_ub is not None: 2025-05-07T20:32:52.3587040Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3587179Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3587259Z ) 2025-05-07T20:32:52.3587339Z else: 2025-05-07T20:32:52.3587437Z scale_ub_tensor = None 2025-05-07T20:32:52.3587512Z 2025-05-07T20:32:52.3587723Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3587815Z op = silu_mul_quant 2025-05-07T20:32:52.3587900Z if compiled: 2025-05-07T20:32:52.3587997Z op = torch.compile(op) 2025-05-07T20:32:52.3588176Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3588247Z 2025-05-07T20:32:52.3588336Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3588345Z 2025-05-07T20:32:52.3588439Z moe/activation_test.py:117: 2025-05-07T20:32:52.3588569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3588671Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3588768Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3589218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3589358Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3589920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3590020Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3590396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3590627Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3590985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3591077Z kernel = self.compile( 2025-05-07T20:32:52.3591475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3591652Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3591781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3591785Z 2025-05-07T20:32:52.3591996Z self = 2025-05-07T20:32:52.3592800Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3593313Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1279cf6d40>} 2025-05-07T20:32:52.3594106Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3594297Z context = 2025-05-07T20:32:52.3594302Z 2025-05-07T20:32:52.3594469Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3594742Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3594849Z module_map=module_map) 2025-05-07T20:32:52.3595014Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3595115Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3595193Z E ^ 2025-05-07T20:32:52.3595560Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3595565Z 2025-05-07T20:32:52.3595995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3596000Z 2025-05-07T20:32:52.3596110Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3596332Z self=, 2025-05-07T20:32:52.3596416Z T=2048, 2025-05-07T20:32:52.3596492Z D=5120, 2025-05-07T20:32:52.3596668Z scale_ub=1200.0, 2025-05-07T20:32:52.3596760Z contiguous=False, 2025-05-07T20:32:52.3596841Z compiled=True, 2025-05-07T20:32:52.3596912Z ) 2025-05-07T20:32:52.3597136Z self = 2025-05-07T20:32:52.3597385Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:52.3597389Z 2025-05-07T20:32:52.3597468Z @given( 2025-05-07T20:32:52.3597589Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3597688Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3597806Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3597924Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3598035Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3598116Z ) 2025-05-07T20:32:52.3598362Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3598453Z def test_silu_mul_quant( 2025-05-07T20:32:52.3598536Z self, 2025-05-07T20:32:52.3598611Z T: int, 2025-05-07T20:32:52.3598687Z D: int, 2025-05-07T20:32:52.3598790Z scale_ub: Optional[float], 2025-05-07T20:32:52.3598878Z contiguous: bool, 2025-05-07T20:32:52.3598967Z compiled: bool, 2025-05-07T20:32:52.3599046Z ) -> None: 2025-05-07T20:32:52.3599141Z torch.manual_seed(2025) 2025-05-07T20:32:52.3599213Z 2025-05-07T20:32:52.3599383Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3599456Z 2025-05-07T20:32:52.3599550Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3599675Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3599761Z x = x_sign * x_clamp 2025-05-07T20:32:52.3599841Z x0 = x[:, :D] 2025-05-07T20:32:52.3599919Z x1 = x[:, D:] 2025-05-07T20:32:52.3599989Z 2025-05-07T20:32:52.3600075Z if contiguous: 2025-05-07T20:32:52.3600172Z x0 = x0.contiguous() 2025-05-07T20:32:52.3600260Z x1 = x1.contiguous() 2025-05-07T20:32:52.3600332Z 2025-05-07T20:32:52.3600421Z if scale_ub is not None: 2025-05-07T20:32:52.3600525Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3600664Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3600737Z ) 2025-05-07T20:32:52.3600815Z else: 2025-05-07T20:32:52.3600906Z scale_ub_tensor = None 2025-05-07T20:32:52.3600976Z 2025-05-07T20:32:52.3601106Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3601197Z op = silu_mul_quant 2025-05-07T20:32:52.3601283Z if compiled: 2025-05-07T20:32:52.3601383Z op = torch.compile(op) 2025-05-07T20:32:52.3601490Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3601565Z 2025-05-07T20:32:52.3601658Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3601662Z 2025-05-07T20:32:52.3601761Z moe/activation_test.py:117: 2025-05-07T20:32:52.3601893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3601993Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3602092Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3602485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3602577Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3603098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3603201Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3603575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3603809Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3604308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3604403Z kernel = self.compile( 2025-05-07T20:32:52.3604807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3605060Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3605196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3605200Z 2025-05-07T20:32:52.3605412Z self = 2025-05-07T20:32:52.3606226Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3606756Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ac1a200>} 2025-05-07T20:32:52.3607553Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3607756Z context = 2025-05-07T20:32:52.3607760Z 2025-05-07T20:32:52.3607934Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3608211Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3608328Z module_map=module_map) 2025-05-07T20:32:52.3608492Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3608596Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3608679Z E ^ 2025-05-07T20:32:52.3609051Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3609056Z 2025-05-07T20:32:52.3609497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3609501Z 2025-05-07T20:32:52.3609611Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3609844Z self=, 2025-05-07T20:32:52.3609928Z T=4096, 2025-05-07T20:32:52.3610008Z D=5120, 2025-05-07T20:32:52.3610091Z scale_ub=1200.0, 2025-05-07T20:32:52.3610176Z contiguous=True, 2025-05-07T20:32:52.3610265Z compiled=True, 2025-05-07T20:32:52.3610341Z ) 2025-05-07T20:32:52.3610561Z self = 2025-05-07T20:32:52.3610737Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:52.3610741Z 2025-05-07T20:32:52.3610820Z @given( 2025-05-07T20:32:52.3610943Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3611043Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3611163Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3611278Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3611403Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3611475Z ) 2025-05-07T20:32:52.3611724Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3611820Z def test_silu_mul_quant( 2025-05-07T20:32:52.3611895Z self, 2025-05-07T20:32:52.3611970Z T: int, 2025-05-07T20:32:52.3612048Z D: int, 2025-05-07T20:32:52.3612148Z scale_ub: Optional[float], 2025-05-07T20:32:52.3612236Z contiguous: bool, 2025-05-07T20:32:52.3612321Z compiled: bool, 2025-05-07T20:32:52.3612397Z ) -> None: 2025-05-07T20:32:52.3612491Z torch.manual_seed(2025) 2025-05-07T20:32:52.3612567Z 2025-05-07T20:32:52.3612883Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3612962Z 2025-05-07T20:32:52.3613070Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3613196Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3613371Z x = x_sign * x_clamp 2025-05-07T20:32:52.3613454Z x0 = x[:, :D] 2025-05-07T20:32:52.3613533Z x1 = x[:, D:] 2025-05-07T20:32:52.3613616Z 2025-05-07T20:32:52.3613703Z if contiguous: 2025-05-07T20:32:52.3613797Z x0 = x0.contiguous() 2025-05-07T20:32:52.3613919Z x1 = x1.contiguous() 2025-05-07T20:32:52.3613997Z 2025-05-07T20:32:52.3614110Z if scale_ub is not None: 2025-05-07T20:32:52.3614228Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3614367Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3614556Z ) 2025-05-07T20:32:52.3614635Z else: 2025-05-07T20:32:52.3614737Z scale_ub_tensor = None 2025-05-07T20:32:52.3614816Z 2025-05-07T20:32:52.3614944Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3615034Z op = silu_mul_quant 2025-05-07T20:32:52.3615122Z if compiled: 2025-05-07T20:32:52.3615227Z op = torch.compile(op) 2025-05-07T20:32:52.3615333Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3615414Z 2025-05-07T20:32:52.3615507Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3615511Z 2025-05-07T20:32:52.3615607Z moe/activation_test.py:117: 2025-05-07T20:32:52.3615741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3615840Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3615940Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3616327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3616419Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3616952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3617049Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3617427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3617664Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3618023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3618122Z kernel = self.compile( 2025-05-07T20:32:52.3618524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3618699Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3618836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3618844Z 2025-05-07T20:32:52.3619049Z self = 2025-05-07T20:32:52.3619863Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3620386Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ac19080>} 2025-05-07T20:32:52.3621180Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3621388Z context = 2025-05-07T20:32:52.3621393Z 2025-05-07T20:32:52.3621644Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3621924Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3622035Z module_map=module_map) 2025-05-07T20:32:52.3622348Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3622454Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3622534Z E ^ 2025-05-07T20:32:52.3622917Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3622922Z 2025-05-07T20:32:52.3623357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3623362Z 2025-05-07T20:32:52.3623463Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3623694Z self=, 2025-05-07T20:32:52.3623778Z T=128, 2025-05-07T20:32:52.3623861Z D=5120, 2025-05-07T20:32:52.3623942Z scale_ub=1200.0, 2025-05-07T20:32:52.3624030Z contiguous=False, 2025-05-07T20:32:52.3624111Z compiled=True, 2025-05-07T20:32:52.3624182Z ) 2025-05-07T20:32:52.3624410Z self = 2025-05-07T20:32:52.3624588Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:52.3624593Z 2025-05-07T20:32:52.3624668Z @given( 2025-05-07T20:32:52.3624790Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3624888Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3625003Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3625120Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3625233Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3625309Z ) 2025-05-07T20:32:52.3625837Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3625955Z def test_silu_mul_quant( 2025-05-07T20:32:52.3626036Z self, 2025-05-07T20:32:52.3626110Z T: int, 2025-05-07T20:32:52.3626186Z D: int, 2025-05-07T20:32:52.3626286Z scale_ub: Optional[float], 2025-05-07T20:32:52.3626380Z contiguous: bool, 2025-05-07T20:32:52.3626469Z compiled: bool, 2025-05-07T20:32:52.3626550Z ) -> None: 2025-05-07T20:32:52.3626643Z torch.manual_seed(2025) 2025-05-07T20:32:52.3626717Z 2025-05-07T20:32:52.3626887Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3626959Z 2025-05-07T20:32:52.3627053Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3627174Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3627262Z x = x_sign * x_clamp 2025-05-07T20:32:52.3627348Z x0 = x[:, :D] 2025-05-07T20:32:52.3627426Z x1 = x[:, D:] 2025-05-07T20:32:52.3627497Z 2025-05-07T20:32:52.3627589Z if contiguous: 2025-05-07T20:32:52.3627682Z x0 = x0.contiguous() 2025-05-07T20:32:52.3627770Z x1 = x1.contiguous() 2025-05-07T20:32:52.3627845Z 2025-05-07T20:32:52.3627932Z if scale_ub is not None: 2025-05-07T20:32:52.3628048Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3628181Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3628254Z ) 2025-05-07T20:32:52.3628333Z else: 2025-05-07T20:32:52.3628425Z scale_ub_tensor = None 2025-05-07T20:32:52.3628496Z 2025-05-07T20:32:52.3628626Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3628714Z op = silu_mul_quant 2025-05-07T20:32:52.3628799Z if compiled: 2025-05-07T20:32:52.3628903Z op = torch.compile(op) 2025-05-07T20:32:52.3629008Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3629079Z 2025-05-07T20:32:52.3629171Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3629327Z 2025-05-07T20:32:52.3629429Z moe/activation_test.py:117: 2025-05-07T20:32:52.3629565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3629668Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3629878Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3630263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3630356Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3630876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3630977Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3631350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3631579Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3631938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3632030Z kernel = self.compile( 2025-05-07T20:32:52.3632433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3632613Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3632745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3632750Z 2025-05-07T20:32:52.3632958Z self = 2025-05-07T20:32:52.3633768Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3634292Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ad1f4c0>} 2025-05-07T20:32:52.3635085Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3635286Z context = 2025-05-07T20:32:52.3635290Z 2025-05-07T20:32:52.3635458Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3635729Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3635839Z module_map=module_map) 2025-05-07T20:32:52.3636003Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3636101Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3636177Z E ^ 2025-05-07T20:32:52.3636546Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3636550Z 2025-05-07T20:32:52.3636989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3636998Z 2025-05-07T20:32:52.3637100Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3637326Z self=, 2025-05-07T20:32:52.3637404Z T=16384, 2025-05-07T20:32:52.3637482Z D=7168, 2025-05-07T20:32:52.3637570Z scale_ub=1200.0, 2025-05-07T20:32:52.3637656Z contiguous=True, 2025-05-07T20:32:52.3637741Z compiled=True, 2025-05-07T20:32:52.3637822Z ) 2025-05-07T20:32:52.3638048Z self = 2025-05-07T20:32:52.3638228Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:52.3638232Z 2025-05-07T20:32:52.3638400Z @given( 2025-05-07T20:32:52.3638525Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3638627Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3638747Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3638950Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3639069Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3639148Z ) 2025-05-07T20:32:52.3639401Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3639501Z def test_silu_mul_quant( 2025-05-07T20:32:52.3639581Z self, 2025-05-07T20:32:52.3639661Z T: int, 2025-05-07T20:32:52.3639746Z D: int, 2025-05-07T20:32:52.3639846Z scale_ub: Optional[float], 2025-05-07T20:32:52.3639937Z contiguous: bool, 2025-05-07T20:32:52.3640031Z compiled: bool, 2025-05-07T20:32:52.3640108Z ) -> None: 2025-05-07T20:32:52.3640208Z torch.manual_seed(2025) 2025-05-07T20:32:52.3640284Z 2025-05-07T20:32:52.3640455Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3640531Z 2025-05-07T20:32:52.3640628Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3640757Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3640851Z x = x_sign * x_clamp 2025-05-07T20:32:52.3640930Z x0 = x[:, :D] 2025-05-07T20:32:52.3641007Z x1 = x[:, D:] 2025-05-07T20:32:52.3641079Z 2025-05-07T20:32:52.3641164Z if contiguous: 2025-05-07T20:32:52.3641254Z x0 = x0.contiguous() 2025-05-07T20:32:52.3641344Z x1 = x1.contiguous() 2025-05-07T20:32:52.3641415Z 2025-05-07T20:32:52.3641502Z if scale_ub is not None: 2025-05-07T20:32:52.3641609Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3641744Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3641825Z ) 2025-05-07T20:32:52.3641903Z else: 2025-05-07T20:32:52.3641997Z scale_ub_tensor = None 2025-05-07T20:32:52.3642071Z 2025-05-07T20:32:52.3642199Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3642288Z op = silu_mul_quant 2025-05-07T20:32:52.3642382Z if compiled: 2025-05-07T20:32:52.3642480Z op = torch.compile(op) 2025-05-07T20:32:52.3642584Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3642659Z 2025-05-07T20:32:52.3642750Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3642755Z 2025-05-07T20:32:52.3642851Z moe/activation_test.py:117: 2025-05-07T20:32:52.3642986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3643086Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3643188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3643571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3643668Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3644193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3644292Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3644672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3644904Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3645258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3645355Z kernel = self.compile( 2025-05-07T20:32:52.3645758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3645934Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3646184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3646189Z 2025-05-07T20:32:52.3646400Z self = 2025-05-07T20:32:52.3647213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3647802Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ad1f880>} 2025-05-07T20:32:52.3648599Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3648793Z context = 2025-05-07T20:32:52.3648803Z 2025-05-07T20:32:52.3648968Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3649242Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3649354Z module_map=module_map) 2025-05-07T20:32:52.3649514Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3649613Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3649688Z E ^ 2025-05-07T20:32:52.3650055Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3650060Z 2025-05-07T20:32:52.3650496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3650500Z 2025-05-07T20:32:52.3650602Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3650836Z self=, 2025-05-07T20:32:52.3650911Z T=16384, 2025-05-07T20:32:52.3650988Z D=5120, 2025-05-07T20:32:52.3651070Z scale_ub=1200.0, 2025-05-07T20:32:52.3651153Z contiguous=True, 2025-05-07T20:32:52.3651241Z compiled=False, 2025-05-07T20:32:52.3651316Z ) 2025-05-07T20:32:52.3651541Z self = 2025-05-07T20:32:52.3651721Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:52.3651726Z 2025-05-07T20:32:52.3651802Z @given( 2025-05-07T20:32:52.3651918Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3652018Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3652132Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3652250Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3652362Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3652435Z ) 2025-05-07T20:32:52.3652691Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3652783Z def test_silu_mul_quant( 2025-05-07T20:32:52.3652858Z self, 2025-05-07T20:32:52.3652937Z T: int, 2025-05-07T20:32:52.3653016Z D: int, 2025-05-07T20:32:52.3653114Z scale_ub: Optional[float], 2025-05-07T20:32:52.3653207Z contiguous: bool, 2025-05-07T20:32:52.3653292Z compiled: bool, 2025-05-07T20:32:52.3653369Z ) -> None: 2025-05-07T20:32:52.3653469Z torch.manual_seed(2025) 2025-05-07T20:32:52.3653545Z 2025-05-07T20:32:52.3653720Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3653802Z 2025-05-07T20:32:52.3653897Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3654028Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3654124Z x = x_sign * x_clamp 2025-05-07T20:32:52.3654207Z x0 = x[:, :D] 2025-05-07T20:32:52.3654493Z x1 = x[:, D:] 2025-05-07T20:32:52.3654573Z 2025-05-07T20:32:52.3654661Z if contiguous: 2025-05-07T20:32:52.3654762Z x0 = x0.contiguous() 2025-05-07T20:32:52.3654854Z x1 = x1.contiguous() 2025-05-07T20:32:52.3654929Z 2025-05-07T20:32:52.3655106Z if scale_ub is not None: 2025-05-07T20:32:52.3655214Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3655352Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3655432Z ) 2025-05-07T20:32:52.3655510Z else: 2025-05-07T20:32:52.3655610Z scale_ub_tensor = None 2025-05-07T20:32:52.3655684Z 2025-05-07T20:32:52.3655813Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3655906Z op = silu_mul_quant 2025-05-07T20:32:52.3655990Z if compiled: 2025-05-07T20:32:52.3656088Z op = torch.compile(op) 2025-05-07T20:32:52.3656194Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3656272Z 2025-05-07T20:32:52.3656367Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3656371Z 2025-05-07T20:32:52.3656471Z moe/activation_test.py:117: 2025-05-07T20:32:52.3656600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3656706Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3656804Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3657329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3657429Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3657803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3658032Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3658393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3658484Z kernel = self.compile( 2025-05-07T20:32:52.3658885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3659057Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3659191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3659195Z 2025-05-07T20:32:52.3659402Z self = 2025-05-07T20:32:52.3660209Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3660723Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127ad1c720>} 2025-05-07T20:32:52.3661518Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3661716Z context = 2025-05-07T20:32:52.3661723Z 2025-05-07T20:32:52.3661888Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3662156Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3662265Z module_map=module_map) 2025-05-07T20:32:52.3662425Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3662524Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3662603Z E ^ 2025-05-07T20:32:52.3662968Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3663056Z 2025-05-07T20:32:52.3663499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3663504Z 2025-05-07T20:32:52.3663609Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3663942Z self=, 2025-05-07T20:32:52.3664042Z T=1, 2025-05-07T20:32:52.3664132Z D=7168, 2025-05-07T20:32:52.3664212Z scale_ub=1200.0, 2025-05-07T20:32:52.3664301Z contiguous=False, 2025-05-07T20:32:52.3664384Z compiled=False, 2025-05-07T20:32:52.3664456Z ) 2025-05-07T20:32:52.3664677Z self = 2025-05-07T20:32:52.3664845Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:52.3664850Z 2025-05-07T20:32:52.3664928Z @given( 2025-05-07T20:32:52.3665048Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3665154Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3665275Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3665394Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3665509Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3665594Z ) 2025-05-07T20:32:52.3665844Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3665942Z def test_silu_mul_quant( 2025-05-07T20:32:52.3666023Z self, 2025-05-07T20:32:52.3666101Z T: int, 2025-05-07T20:32:52.3666184Z D: int, 2025-05-07T20:32:52.3666284Z scale_ub: Optional[float], 2025-05-07T20:32:52.3666375Z contiguous: bool, 2025-05-07T20:32:52.3666467Z compiled: bool, 2025-05-07T20:32:52.3666544Z ) -> None: 2025-05-07T20:32:52.3666638Z torch.manual_seed(2025) 2025-05-07T20:32:52.3666715Z 2025-05-07T20:32:52.3666887Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3666963Z 2025-05-07T20:32:52.3667058Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3667179Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3667268Z x = x_sign * x_clamp 2025-05-07T20:32:52.3667362Z x0 = x[:, :D] 2025-05-07T20:32:52.3667440Z x1 = x[:, D:] 2025-05-07T20:32:52.3667518Z 2025-05-07T20:32:52.3667601Z if contiguous: 2025-05-07T20:32:52.3667692Z x0 = x0.contiguous() 2025-05-07T20:32:52.3667786Z x1 = x1.contiguous() 2025-05-07T20:32:52.3667856Z 2025-05-07T20:32:52.3667945Z if scale_ub is not None: 2025-05-07T20:32:52.3668049Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3668182Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3668258Z ) 2025-05-07T20:32:52.3668338Z else: 2025-05-07T20:32:52.3668430Z scale_ub_tensor = None 2025-05-07T20:32:52.3668502Z 2025-05-07T20:32:52.3668640Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3668730Z op = silu_mul_quant 2025-05-07T20:32:52.3668812Z if compiled: 2025-05-07T20:32:52.3668914Z op = torch.compile(op) 2025-05-07T20:32:52.3669024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3669097Z 2025-05-07T20:32:52.3669187Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3669191Z 2025-05-07T20:32:52.3669286Z moe/activation_test.py:117: 2025-05-07T20:32:52.3669421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3669521Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3669619Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3670147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3670242Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3670714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3670948Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3671308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3671505Z kernel = self.compile( 2025-05-07T20:32:52.3671905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3672080Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3672215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3672220Z 2025-05-07T20:32:52.3672423Z self = 2025-05-07T20:32:52.3673242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3673755Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127b2f05e0>} 2025-05-07T20:32:52.3674558Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3674755Z context = 2025-05-07T20:32:52.3674760Z 2025-05-07T20:32:52.3674927Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3675204Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3675309Z module_map=module_map) 2025-05-07T20:32:52.3675479Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3675576Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3675652Z E ^ 2025-05-07T20:32:52.3676020Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3676029Z 2025-05-07T20:32:52.3676462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3676467Z 2025-05-07T20:32:52.3676570Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3676797Z self=, 2025-05-07T20:32:52.3676876Z T=4096, 2025-05-07T20:32:52.3676958Z D=7168, 2025-05-07T20:32:52.3677046Z scale_ub=1200.0, 2025-05-07T20:32:52.3677135Z contiguous=False, 2025-05-07T20:32:52.3677226Z compiled=True, 2025-05-07T20:32:52.3677301Z ) 2025-05-07T20:32:52.3677528Z self = 2025-05-07T20:32:52.3677714Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:52.3677718Z 2025-05-07T20:32:52.3677798Z @given( 2025-05-07T20:32:52.3677921Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3678026Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3678141Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3678263Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3678379Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3678455Z ) 2025-05-07T20:32:52.3678709Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3678808Z def test_silu_mul_quant( 2025-05-07T20:32:52.3678886Z self, 2025-05-07T20:32:52.3678974Z T: int, 2025-05-07T20:32:52.3679052Z D: int, 2025-05-07T20:32:52.3679237Z scale_ub: Optional[float], 2025-05-07T20:32:52.3679335Z contiguous: bool, 2025-05-07T20:32:52.3679426Z compiled: bool, 2025-05-07T20:32:52.3679502Z ) -> None: 2025-05-07T20:32:52.3679600Z torch.manual_seed(2025) 2025-05-07T20:32:52.3679671Z 2025-05-07T20:32:52.3679924Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3680001Z 2025-05-07T20:32:52.3680096Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3680228Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3680321Z x = x_sign * x_clamp 2025-05-07T20:32:52.3680402Z x0 = x[:, :D] 2025-05-07T20:32:52.3680490Z x1 = x[:, D:] 2025-05-07T20:32:52.3680567Z 2025-05-07T20:32:52.3680654Z if contiguous: 2025-05-07T20:32:52.3680751Z x0 = x0.contiguous() 2025-05-07T20:32:52.3680841Z x1 = x1.contiguous() 2025-05-07T20:32:52.3680917Z 2025-05-07T20:32:52.3681012Z if scale_ub is not None: 2025-05-07T20:32:52.3681125Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3681264Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3681342Z ) 2025-05-07T20:32:52.3681420Z else: 2025-05-07T20:32:52.3681525Z scale_ub_tensor = None 2025-05-07T20:32:52.3681596Z 2025-05-07T20:32:52.3681723Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3681817Z op = silu_mul_quant 2025-05-07T20:32:52.3681900Z if compiled: 2025-05-07T20:32:52.3685455Z op = torch.compile(op) 2025-05-07T20:32:52.3685586Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3685668Z 2025-05-07T20:32:52.3685762Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3685767Z 2025-05-07T20:32:52.3685867Z moe/activation_test.py:117: 2025-05-07T20:32:52.3686010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3686123Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3686228Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3686630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3686724Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3687263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3687360Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3687737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3687967Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3688329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3688422Z kernel = self.compile( 2025-05-07T20:32:52.3688836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3689013Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3689147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3689157Z 2025-05-07T20:32:52.3689368Z self = 2025-05-07T20:32:52.3690185Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3690708Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f127b0aafc0>} 2025-05-07T20:32:52.3692013Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3692224Z context = 2025-05-07T20:32:52.3692228Z 2025-05-07T20:32:52.3692472Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3692748Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3692855Z module_map=module_map) 2025-05-07T20:32:52.3693020Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3693123Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3693203Z E ^ 2025-05-07T20:32:52.3693577Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3693582Z 2025-05-07T20:32:52.3694025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3694030Z 2025-05-07T20:32:52.3694131Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3694362Z self=, 2025-05-07T20:32:52.3694535Z T=128, 2025-05-07T20:32:52.3694617Z D=7168, 2025-05-07T20:32:52.3694708Z scale_ub=1200.0, 2025-05-07T20:32:52.3694797Z contiguous=False, 2025-05-07T20:32:52.3694883Z compiled=True, 2025-05-07T20:32:52.3694962Z ) 2025-05-07T20:32:52.3695189Z self = 2025-05-07T20:32:52.3695370Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:32:52.3695380Z 2025-05-07T20:32:52.3695461Z @given( 2025-05-07T20:32:52.3695583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3695688Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3695804Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3695927Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3696048Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3696124Z ) 2025-05-07T20:32:52.3696377Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3696486Z def test_silu_mul_quant( 2025-05-07T20:32:52.3696564Z self, 2025-05-07T20:32:52.3696646Z T: int, 2025-05-07T20:32:52.3696728Z D: int, 2025-05-07T20:32:52.3696828Z scale_ub: Optional[float], 2025-05-07T20:32:52.3696923Z contiguous: bool, 2025-05-07T20:32:52.3697010Z compiled: bool, 2025-05-07T20:32:52.3697093Z ) -> None: 2025-05-07T20:32:52.3697190Z torch.manual_seed(2025) 2025-05-07T20:32:52.3697264Z 2025-05-07T20:32:52.3697434Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3697512Z 2025-05-07T20:32:52.3697604Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3697735Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3697829Z x = x_sign * x_clamp 2025-05-07T20:32:52.3697911Z x0 = x[:, :D] 2025-05-07T20:32:52.3697988Z x1 = x[:, D:] 2025-05-07T20:32:52.3698062Z 2025-05-07T20:32:52.3698149Z if contiguous: 2025-05-07T20:32:52.3698247Z x0 = x0.contiguous() 2025-05-07T20:32:52.3698335Z x1 = x1.contiguous() 2025-05-07T20:32:52.3698408Z 2025-05-07T20:32:52.3698500Z if scale_ub is not None: 2025-05-07T20:32:52.3698605Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3698735Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3698815Z ) 2025-05-07T20:32:52.3698895Z else: 2025-05-07T20:32:52.3698992Z scale_ub_tensor = None 2025-05-07T20:32:52.3699070Z 2025-05-07T20:32:52.3699197Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3699289Z op = silu_mul_quant 2025-05-07T20:32:52.3699466Z if compiled: 2025-05-07T20:32:52.3699567Z op = torch.compile(op) 2025-05-07T20:32:52.3699670Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3699743Z 2025-05-07T20:32:52.3699908Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3699912Z 2025-05-07T20:32:52.3700009Z moe/activation_test.py:117: 2025-05-07T20:32:52.3700139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3700239Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3700339Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3700724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3700817Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3701338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3701445Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3701819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3702045Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3702403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3702501Z kernel = self.compile( 2025-05-07T20:32:52.3702900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3703079Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3703208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3703212Z 2025-05-07T20:32:52.3703420Z self = 2025-05-07T20:32:52.3704235Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3704746Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f129675d800>} 2025-05-07T20:32:52.3705547Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3705738Z context = 2025-05-07T20:32:52.3705743Z 2025-05-07T20:32:52.3705907Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3706180Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3706287Z module_map=module_map) 2025-05-07T20:32:52.3706451Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3706549Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3706626Z E ^ 2025-05-07T20:32:52.3707002Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3707006Z 2025-05-07T20:32:52.3707436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3707440Z 2025-05-07T20:32:52.3707545Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3707771Z self=, 2025-05-07T20:32:52.3707846Z T=2048, 2025-05-07T20:32:52.3707926Z D=7168, 2025-05-07T20:32:52.3708006Z scale_ub=None, 2025-05-07T20:32:52.3708091Z contiguous=True, 2025-05-07T20:32:52.3708177Z compiled=True, 2025-05-07T20:32:52.3708328Z ) 2025-05-07T20:32:52.3708552Z self = 2025-05-07T20:32:52.3708727Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:52.3708827Z 2025-05-07T20:32:52.3708905Z @given( 2025-05-07T20:32:52.3709027Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3709124Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3709235Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3709354Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3709467Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3709539Z ) 2025-05-07T20:32:52.3709789Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3709882Z def test_silu_mul_quant( 2025-05-07T20:32:52.3709958Z self, 2025-05-07T20:32:52.3710037Z T: int, 2025-05-07T20:32:52.3710119Z D: int, 2025-05-07T20:32:52.3710216Z scale_ub: Optional[float], 2025-05-07T20:32:52.3710308Z contiguous: bool, 2025-05-07T20:32:52.3710393Z compiled: bool, 2025-05-07T20:32:52.3710475Z ) -> None: 2025-05-07T20:32:52.3710578Z torch.manual_seed(2025) 2025-05-07T20:32:52.3710651Z 2025-05-07T20:32:52.3710824Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3710896Z 2025-05-07T20:32:52.3710986Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3711113Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3711201Z x = x_sign * x_clamp 2025-05-07T20:32:52.3711279Z x0 = x[:, :D] 2025-05-07T20:32:52.3711362Z x1 = x[:, D:] 2025-05-07T20:32:52.3711434Z 2025-05-07T20:32:52.3711518Z if contiguous: 2025-05-07T20:32:52.3711616Z x0 = x0.contiguous() 2025-05-07T20:32:52.3711706Z x1 = x1.contiguous() 2025-05-07T20:32:52.3711778Z 2025-05-07T20:32:52.3711876Z if scale_ub is not None: 2025-05-07T20:32:52.3711983Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3712118Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3712192Z ) 2025-05-07T20:32:52.3712275Z else: 2025-05-07T20:32:52.3712370Z scale_ub_tensor = None 2025-05-07T20:32:52.3712441Z 2025-05-07T20:32:52.3712569Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3712663Z op = silu_mul_quant 2025-05-07T20:32:52.3712748Z if compiled: 2025-05-07T20:32:52.3712847Z op = torch.compile(op) 2025-05-07T20:32:52.3712955Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3713027Z 2025-05-07T20:32:52.3713116Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3713123Z 2025-05-07T20:32:52.3713218Z moe/activation_test.py:117: 2025-05-07T20:32:52.3713352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3713452Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3713549Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3713932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3714031Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3714550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3714650Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3715022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3715247Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3715602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3715694Z kernel = self.compile( 2025-05-07T20:32:52.3716178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3716359Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3716561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3716565Z 2025-05-07T20:32:52.3716776Z self = 2025-05-07T20:32:52.3717586Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3718100Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1278524400>} 2025-05-07T20:32:52.3718905Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3719100Z context = 2025-05-07T20:32:52.3719108Z 2025-05-07T20:32:52.3719283Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3719555Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3719662Z module_map=module_map) 2025-05-07T20:32:52.3719830Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3719931Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3720013Z E ^ 2025-05-07T20:32:52.3720382Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3720386Z 2025-05-07T20:32:52.3720822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3720826Z 2025-05-07T20:32:52.3720933Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3721155Z self=, 2025-05-07T20:32:52.3721240Z T=16384, 2025-05-07T20:32:52.3721318Z D=5120, 2025-05-07T20:32:52.3721404Z scale_ub=None, 2025-05-07T20:32:52.3721499Z contiguous=False, 2025-05-07T20:32:52.3721586Z compiled=False, 2025-05-07T20:32:52.3721662Z ) 2025-05-07T20:32:52.3721887Z self = 2025-05-07T20:32:52.3722069Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:52.3722073Z 2025-05-07T20:32:52.3722153Z @given( 2025-05-07T20:32:52.3722279Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3722379Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3722505Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3722623Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3722737Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3722820Z ) 2025-05-07T20:32:52.3723074Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3723170Z def test_silu_mul_quant( 2025-05-07T20:32:52.3723256Z self, 2025-05-07T20:32:52.3723335Z T: int, 2025-05-07T20:32:52.3723414Z D: int, 2025-05-07T20:32:52.3723518Z scale_ub: Optional[float], 2025-05-07T20:32:52.3723611Z contiguous: bool, 2025-05-07T20:32:52.3723701Z compiled: bool, 2025-05-07T20:32:52.3723782Z ) -> None: 2025-05-07T20:32:52.3723875Z torch.manual_seed(2025) 2025-05-07T20:32:52.3723953Z 2025-05-07T20:32:52.3724119Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3724192Z 2025-05-07T20:32:52.3724369Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3724519Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3726745Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3726894Z 2025-05-07T20:32:52.3727020Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:52.3727025Z 2025-05-07T20:32:52.3727131Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3727372Z self=, 2025-05-07T20:32:52.3727450Z T=4096, 2025-05-07T20:32:52.3727532Z D=7168, 2025-05-07T20:32:52.3727620Z scale_ub=1200.0, 2025-05-07T20:32:52.3727704Z contiguous=True, 2025-05-07T20:32:52.3727792Z compiled=True, 2025-05-07T20:32:52.3727872Z ) 2025-05-07T20:32:52.3728098Z self = 2025-05-07T20:32:52.3728275Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:52.3728280Z 2025-05-07T20:32:52.3728359Z @given( 2025-05-07T20:32:52.3728478Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3728583Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3728699Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3728817Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3728933Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3729009Z ) 2025-05-07T20:32:52.3729270Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3729368Z def test_silu_mul_quant( 2025-05-07T20:32:52.3729447Z self, 2025-05-07T20:32:52.3729527Z T: int, 2025-05-07T20:32:52.3729606Z D: int, 2025-05-07T20:32:52.3729709Z scale_ub: Optional[float], 2025-05-07T20:32:52.3729802Z contiguous: bool, 2025-05-07T20:32:52.3729888Z compiled: bool, 2025-05-07T20:32:52.3729969Z ) -> None: 2025-05-07T20:32:52.3730067Z torch.manual_seed(2025) 2025-05-07T20:32:52.3730138Z 2025-05-07T20:32:52.3730308Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3730388Z 2025-05-07T20:32:52.3730478Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3730603Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3732516Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3732528Z 2025-05-07T20:32:52.3732653Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:52.3732657Z 2025-05-07T20:32:52.3732760Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3732981Z self=, 2025-05-07T20:32:52.3733061Z T=16384, 2025-05-07T20:32:52.3733137Z D=7168, 2025-05-07T20:32:52.3733217Z scale_ub=None, 2025-05-07T20:32:52.3733305Z contiguous=False, 2025-05-07T20:32:52.3733388Z compiled=False, 2025-05-07T20:32:52.3733461Z ) 2025-05-07T20:32:52.3733804Z self = 2025-05-07T20:32:52.3734008Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:52.3734014Z 2025-05-07T20:32:52.3734194Z @given( 2025-05-07T20:32:52.3734312Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3734467Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3734590Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3734710Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3734829Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3734909Z ) 2025-05-07T20:32:52.3735162Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3735261Z def test_silu_mul_quant( 2025-05-07T20:32:52.3735347Z self, 2025-05-07T20:32:52.3735425Z T: int, 2025-05-07T20:32:52.3735508Z D: int, 2025-05-07T20:32:52.3735613Z scale_ub: Optional[float], 2025-05-07T20:32:52.3735705Z contiguous: bool, 2025-05-07T20:32:52.3735797Z compiled: bool, 2025-05-07T20:32:52.3735873Z ) -> None: 2025-05-07T20:32:52.3735969Z torch.manual_seed(2025) 2025-05-07T20:32:52.3736052Z 2025-05-07T20:32:52.3736220Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3738141Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3738151Z 2025-05-07T20:32:52.3738265Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:52.3738269Z 2025-05-07T20:32:52.3738375Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3738601Z self=, 2025-05-07T20:32:52.3738681Z T=2048, 2025-05-07T20:32:52.3738762Z D=7168, 2025-05-07T20:32:52.3738843Z scale_ub=1200.0, 2025-05-07T20:32:52.3738928Z contiguous=True, 2025-05-07T20:32:52.3739018Z compiled=True, 2025-05-07T20:32:52.3739093Z ) 2025-05-07T20:32:52.3739310Z self = 2025-05-07T20:32:52.3739485Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:52.3739489Z 2025-05-07T20:32:52.3739564Z @given( 2025-05-07T20:32:52.3739677Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3739781Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3739896Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3740015Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3740124Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3740199Z ) 2025-05-07T20:32:52.3740449Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3740547Z def test_silu_mul_quant( 2025-05-07T20:32:52.3740622Z self, 2025-05-07T20:32:52.3740701Z T: int, 2025-05-07T20:32:52.3740777Z D: int, 2025-05-07T20:32:52.3740876Z scale_ub: Optional[float], 2025-05-07T20:32:52.3740969Z contiguous: bool, 2025-05-07T20:32:52.3741055Z compiled: bool, 2025-05-07T20:32:52.3741131Z ) -> None: 2025-05-07T20:32:52.3741229Z torch.manual_seed(2025) 2025-05-07T20:32:52.3741300Z 2025-05-07T20:32:52.3741469Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3741547Z 2025-05-07T20:32:52.3741748Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3741881Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3743774Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3743851Z 2025-05-07T20:32:52.3743973Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:52.3743978Z 2025-05-07T20:32:52.3744086Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3744313Z self=, 2025-05-07T20:32:52.3744406Z T=2048, 2025-05-07T20:32:52.3744488Z D=7168, 2025-05-07T20:32:52.3744575Z scale_ub=None, 2025-05-07T20:32:52.3744664Z contiguous=True, 2025-05-07T20:32:52.3744754Z compiled=False, 2025-05-07T20:32:52.3744835Z ) 2025-05-07T20:32:52.3745060Z self = 2025-05-07T20:32:52.3745235Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:52.3745239Z 2025-05-07T20:32:52.3745327Z @given( 2025-05-07T20:32:52.3745445Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3745547Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3745666Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3745783Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3745896Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3745975Z ) 2025-05-07T20:32:52.3746228Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3746329Z def test_silu_mul_quant( 2025-05-07T20:32:52.3746409Z self, 2025-05-07T20:32:52.3746489Z T: int, 2025-05-07T20:32:52.3746571Z D: int, 2025-05-07T20:32:52.3746675Z scale_ub: Optional[float], 2025-05-07T20:32:52.3746767Z contiguous: bool, 2025-05-07T20:32:52.3746860Z compiled: bool, 2025-05-07T20:32:52.3746936Z ) -> None: 2025-05-07T20:32:52.3747030Z torch.manual_seed(2025) 2025-05-07T20:32:52.3747104Z 2025-05-07T20:32:52.3747271Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3747344Z 2025-05-07T20:32:52.3747440Z > x_sign = torch.sign(x) 2025-05-07T20:32:52.3749337Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3749346Z 2025-05-07T20:32:52.3749465Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:52.3749470Z 2025-05-07T20:32:52.3749570Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3749799Z self=, 2025-05-07T20:32:52.3749876Z T=1, 2025-05-07T20:32:52.3749953Z D=7168, 2025-05-07T20:32:52.3750038Z scale_ub=1200.0, 2025-05-07T20:32:52.3750121Z contiguous=True, 2025-05-07T20:32:52.3750204Z compiled=False, 2025-05-07T20:32:52.3750280Z ) 2025-05-07T20:32:52.3750498Z self = 2025-05-07T20:32:52.3750744Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:52.3750749Z 2025-05-07T20:32:52.3750828Z @given( 2025-05-07T20:32:52.3750947Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3751122Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3751236Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3751358Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3751471Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3751551Z ) 2025-05-07T20:32:52.3751803Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3751900Z def test_silu_mul_quant( 2025-05-07T20:32:52.3751983Z self, 2025-05-07T20:32:52.3752065Z T: int, 2025-05-07T20:32:52.3752144Z D: int, 2025-05-07T20:32:52.3752245Z scale_ub: Optional[float], 2025-05-07T20:32:52.3752338Z contiguous: bool, 2025-05-07T20:32:52.3752429Z compiled: bool, 2025-05-07T20:32:52.3752506Z ) -> None: 2025-05-07T20:32:52.3752599Z torch.manual_seed(2025) 2025-05-07T20:32:52.3752669Z 2025-05-07T20:32:52.3752839Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3752916Z 2025-05-07T20:32:52.3753008Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3753134Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3753222Z x = x_sign * x_clamp 2025-05-07T20:32:52.3753305Z x0 = x[:, :D] 2025-05-07T20:32:52.3753384Z x1 = x[:, D:] 2025-05-07T20:32:52.3753453Z 2025-05-07T20:32:52.3753537Z if contiguous: 2025-05-07T20:32:52.3753627Z x0 = x0.contiguous() 2025-05-07T20:32:52.3753714Z x1 = x1.contiguous() 2025-05-07T20:32:52.3753788Z 2025-05-07T20:32:52.3753876Z if scale_ub is not None: 2025-05-07T20:32:52.3754005Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3754172Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3754249Z ) 2025-05-07T20:32:52.3754324Z else: 2025-05-07T20:32:52.3754421Z scale_ub_tensor = None 2025-05-07T20:32:52.3754491Z 2025-05-07T20:32:52.3754626Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3754716Z op = silu_mul_quant 2025-05-07T20:32:52.3754798Z if compiled: 2025-05-07T20:32:52.3754903Z op = torch.compile(op) 2025-05-07T20:32:52.3755007Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3755080Z 2025-05-07T20:32:52.3755173Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3755177Z 2025-05-07T20:32:52.3755272Z moe/activation_test.py:117: 2025-05-07T20:32:52.3755400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3755501Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3755598Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3756130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3756229Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3756601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3756834Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3757186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3757278Z kernel = self.compile( 2025-05-07T20:32:52.3757679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3757852Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3757981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3758067Z 2025-05-07T20:32:52.3758273Z self = 2025-05-07T20:32:52.3759080Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3759744Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f121ba10f40>} 2025-05-07T20:32:52.3760535Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3760730Z context = 2025-05-07T20:32:52.3760735Z 2025-05-07T20:32:52.3760906Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3761179Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3761284Z module_map=module_map) 2025-05-07T20:32:52.3761449Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3761548Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3761625Z E ^ 2025-05-07T20:32:52.3761991Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3761996Z 2025-05-07T20:32:52.3762430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3762434Z 2025-05-07T20:32:52.3762539Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3762765Z self=, 2025-05-07T20:32:52.3762842Z T=128, 2025-05-07T20:32:52.3762921Z D=5120, 2025-05-07T20:32:52.3763004Z scale_ub=None, 2025-05-07T20:32:52.3763089Z contiguous=True, 2025-05-07T20:32:52.3763172Z compiled=False, 2025-05-07T20:32:52.3763245Z ) 2025-05-07T20:32:52.3763463Z self = 2025-05-07T20:32:52.3763637Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:52.3763645Z 2025-05-07T20:32:52.3763724Z @given( 2025-05-07T20:32:52.3763840Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3763939Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3764051Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3764165Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3764293Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3764377Z ) 2025-05-07T20:32:52.3764652Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3764747Z def test_silu_mul_quant( 2025-05-07T20:32:52.3764823Z self, 2025-05-07T20:32:52.3764895Z T: int, 2025-05-07T20:32:52.3764972Z D: int, 2025-05-07T20:32:52.3765073Z scale_ub: Optional[float], 2025-05-07T20:32:52.3765170Z contiguous: bool, 2025-05-07T20:32:52.3765255Z compiled: bool, 2025-05-07T20:32:52.3765333Z ) -> None: 2025-05-07T20:32:52.3765430Z torch.manual_seed(2025) 2025-05-07T20:32:52.3765502Z 2025-05-07T20:32:52.3765671Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3765748Z 2025-05-07T20:32:52.3765838Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3765960Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3766053Z x = x_sign * x_clamp 2025-05-07T20:32:52.3766132Z x0 = x[:, :D] 2025-05-07T20:32:52.3766210Z x1 = x[:, D:] 2025-05-07T20:32:52.3766287Z 2025-05-07T20:32:52.3766451Z if contiguous: 2025-05-07T20:32:52.3766546Z x0 = x0.contiguous() 2025-05-07T20:32:52.3766635Z x1 = x1.contiguous() 2025-05-07T20:32:52.3766706Z 2025-05-07T20:32:52.3766798Z if scale_ub is not None: 2025-05-07T20:32:52.3766904Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3767111Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3767191Z ) 2025-05-07T20:32:52.3767266Z else: 2025-05-07T20:32:52.3767356Z scale_ub_tensor = None 2025-05-07T20:32:52.3767430Z 2025-05-07T20:32:52.3767556Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3767644Z op = silu_mul_quant 2025-05-07T20:32:52.3767730Z if compiled: 2025-05-07T20:32:52.3767828Z op = torch.compile(op) 2025-05-07T20:32:52.3767933Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3768003Z 2025-05-07T20:32:52.3768093Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3768103Z 2025-05-07T20:32:52.3768204Z moe/activation_test.py:117: 2025-05-07T20:32:52.3768333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3768433Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3768541Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3769061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3769156Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3769530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3769759Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3770115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3770208Z kernel = self.compile( 2025-05-07T20:32:52.3770609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3770788Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3770914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3770927Z 2025-05-07T20:32:52.3771135Z self = 2025-05-07T20:32:52.3771937Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3772451Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f121ba12020>} 2025-05-07T20:32:52.3773246Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3773438Z context = 2025-05-07T20:32:52.3773446Z 2025-05-07T20:32:52.3773618Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3773887Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3773992Z module_map=module_map) 2025-05-07T20:32:52.3774155Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3774252Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3774333Z E ^ 2025-05-07T20:32:52.3774841Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3774846Z 2025-05-07T20:32:52.3775379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3775384Z 2025-05-07T20:32:52.3775493Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3775719Z self=, 2025-05-07T20:32:52.3775897Z T=128, 2025-05-07T20:32:52.3775980Z D=7168, 2025-05-07T20:32:52.3776067Z scale_ub=None, 2025-05-07T20:32:52.3776152Z contiguous=True, 2025-05-07T20:32:52.3776237Z compiled=False, 2025-05-07T20:32:52.3776311Z ) 2025-05-07T20:32:52.3776532Z self = 2025-05-07T20:32:52.3776701Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:52.3776705Z 2025-05-07T20:32:52.3776780Z @given( 2025-05-07T20:32:52.3776899Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3776998Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3777115Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3777234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3777343Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3777418Z ) 2025-05-07T20:32:52.3777668Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3777758Z def test_silu_mul_quant( 2025-05-07T20:32:52.3777839Z self, 2025-05-07T20:32:52.3777913Z T: int, 2025-05-07T20:32:52.3777986Z D: int, 2025-05-07T20:32:52.3778087Z scale_ub: Optional[float], 2025-05-07T20:32:52.3778174Z contiguous: bool, 2025-05-07T20:32:52.3778258Z compiled: bool, 2025-05-07T20:32:52.3778337Z ) -> None: 2025-05-07T20:32:52.3778430Z torch.manual_seed(2025) 2025-05-07T20:32:52.3778502Z 2025-05-07T20:32:52.3778675Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3778748Z 2025-05-07T20:32:52.3778846Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3778968Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3779056Z x = x_sign * x_clamp 2025-05-07T20:32:52.3779138Z x0 = x[:, :D] 2025-05-07T20:32:52.3779217Z x1 = x[:, D:] 2025-05-07T20:32:52.3779293Z 2025-05-07T20:32:52.3779378Z if contiguous: 2025-05-07T20:32:52.3779469Z x0 = x0.contiguous() 2025-05-07T20:32:52.3779556Z x1 = x1.contiguous() 2025-05-07T20:32:52.3779632Z 2025-05-07T20:32:52.3779722Z if scale_ub is not None: 2025-05-07T20:32:52.3779828Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3779964Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3780041Z ) 2025-05-07T20:32:52.3780119Z else: 2025-05-07T20:32:52.3780213Z scale_ub_tensor = None 2025-05-07T20:32:52.3780285Z 2025-05-07T20:32:52.3780416Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3780511Z op = silu_mul_quant 2025-05-07T20:32:52.3780594Z if compiled: 2025-05-07T20:32:52.3780697Z op = torch.compile(op) 2025-05-07T20:32:52.3780799Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3780874Z 2025-05-07T20:32:52.3780966Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3780970Z 2025-05-07T20:32:52.3781065Z moe/activation_test.py:117: 2025-05-07T20:32:52.3781196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3781293Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3781390Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3781916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3782011Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3782468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3782705Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3783062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3783236Z kernel = self.compile( 2025-05-07T20:32:52.3783633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3783807Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3783940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3783944Z 2025-05-07T20:32:52.3784151Z self = 2025-05-07T20:32:52.3784964Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3785478Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f121ba12f20>} 2025-05-07T20:32:52.3786271Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3786467Z context = 2025-05-07T20:32:52.3786471Z 2025-05-07T20:32:52.3786636Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3786909Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3787014Z module_map=module_map) 2025-05-07T20:32:52.3787174Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3787278Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3787354Z E ^ 2025-05-07T20:32:52.3787723Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3787735Z 2025-05-07T20:32:52.3788166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3788170Z 2025-05-07T20:32:52.3788271Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3788495Z self=, 2025-05-07T20:32:52.3788572Z T=2048, 2025-05-07T20:32:52.3788649Z D=7168, 2025-05-07T20:32:52.3788734Z scale_ub=1200.0, 2025-05-07T20:32:52.3788818Z contiguous=True, 2025-05-07T20:32:52.3788899Z compiled=False, 2025-05-07T20:32:52.3788976Z ) 2025-05-07T20:32:52.3789194Z self = 2025-05-07T20:32:52.3789377Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:52.3789382Z 2025-05-07T20:32:52.3789458Z @given( 2025-05-07T20:32:52.3789574Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3789681Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3789794Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3789908Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3790025Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3790098Z ) 2025-05-07T20:32:52.3790344Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3790441Z def test_silu_mul_quant( 2025-05-07T20:32:52.3790517Z self, 2025-05-07T20:32:52.3790595Z T: int, 2025-05-07T20:32:52.3790670Z D: int, 2025-05-07T20:32:52.3790769Z scale_ub: Optional[float], 2025-05-07T20:32:52.3790859Z contiguous: bool, 2025-05-07T20:32:52.3791029Z compiled: bool, 2025-05-07T20:32:52.3791110Z ) -> None: 2025-05-07T20:32:52.3791207Z torch.manual_seed(2025) 2025-05-07T20:32:52.3791278Z 2025-05-07T20:32:52.3791448Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3793435Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3793440Z 2025-05-07T20:32:52.3793559Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:52.3793564Z 2025-05-07T20:32:52.3793673Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3793913Z self=, 2025-05-07T20:32:52.3794001Z T=1, 2025-05-07T20:32:52.3794089Z D=5120, 2025-05-07T20:32:52.3794189Z scale_ub=1200.0, 2025-05-07T20:32:52.3794276Z contiguous=True, 2025-05-07T20:32:52.3794361Z compiled=False, 2025-05-07T20:32:52.3794434Z ) 2025-05-07T20:32:52.3794655Z self = 2025-05-07T20:32:52.3794822Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:52.3794826Z 2025-05-07T20:32:52.3794902Z @given( 2025-05-07T20:32:52.3795022Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3795119Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3795236Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3795352Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3795467Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3795545Z ) 2025-05-07T20:32:52.3795793Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3795887Z def test_silu_mul_quant( 2025-05-07T20:32:52.3795971Z self, 2025-05-07T20:32:52.3796047Z T: int, 2025-05-07T20:32:52.3796119Z D: int, 2025-05-07T20:32:52.3796217Z scale_ub: Optional[float], 2025-05-07T20:32:52.3796304Z contiguous: bool, 2025-05-07T20:32:52.3796388Z compiled: bool, 2025-05-07T20:32:52.3796468Z ) -> None: 2025-05-07T20:32:52.3796559Z torch.manual_seed(2025) 2025-05-07T20:32:52.3796634Z 2025-05-07T20:32:52.3796801Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3796875Z 2025-05-07T20:32:52.3796969Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3797090Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3797182Z x = x_sign * x_clamp 2025-05-07T20:32:52.3797265Z x0 = x[:, :D] 2025-05-07T20:32:52.3797343Z x1 = x[:, D:] 2025-05-07T20:32:52.3797414Z 2025-05-07T20:32:52.3797501Z if contiguous: 2025-05-07T20:32:52.3797590Z x0 = x0.contiguous() 2025-05-07T20:32:52.3797681Z x1 = x1.contiguous() 2025-05-07T20:32:52.3797754Z 2025-05-07T20:32:52.3797842Z if scale_ub is not None: 2025-05-07T20:32:52.3797944Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3798081Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3798159Z ) 2025-05-07T20:32:52.3798240Z else: 2025-05-07T20:32:52.3798331Z scale_ub_tensor = None 2025-05-07T20:32:52.3798402Z 2025-05-07T20:32:52.3798532Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3798623Z op = silu_mul_quant 2025-05-07T20:32:52.3798705Z if compiled: 2025-05-07T20:32:52.3798889Z op = torch.compile(op) 2025-05-07T20:32:52.3798994Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3799066Z 2025-05-07T20:32:52.3799160Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3799164Z 2025-05-07T20:32:52.3799362Z moe/activation_test.py:117: 2025-05-07T20:32:52.3799495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3799595Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3799694Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3800218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3800313Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3800683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3800910Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3801268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3801365Z kernel = self.compile( 2025-05-07T20:32:52.3801763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3801943Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3802080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3802084Z 2025-05-07T20:32:52.3802289Z self = 2025-05-07T20:32:52.3803098Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3803614Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f121b9384a0>} 2025-05-07T20:32:52.3804423Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3804653Z context = 2025-05-07T20:32:52.3804658Z 2025-05-07T20:32:52.3804821Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3805091Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3805196Z module_map=module_map) 2025-05-07T20:32:52.3805355Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3805457Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3805532Z E ^ 2025-05-07T20:32:52.3805902Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3805911Z 2025-05-07T20:32:52.3806340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3806349Z 2025-05-07T20:32:52.3806451Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3806681Z self=, 2025-05-07T20:32:52.3806760Z T=2048, 2025-05-07T20:32:52.3806834Z D=5120, 2025-05-07T20:32:52.3806918Z scale_ub=None, 2025-05-07T20:32:52.3807002Z contiguous=True, 2025-05-07T20:32:52.3807085Z compiled=False, 2025-05-07T20:32:52.3807160Z ) 2025-05-07T20:32:52.3807378Z self = 2025-05-07T20:32:52.3807552Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:52.3807556Z 2025-05-07T20:32:52.3807717Z @given( 2025-05-07T20:32:52.3807836Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3807939Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3808051Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3808244Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3808366Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3808443Z ) 2025-05-07T20:32:52.3808695Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3808795Z def test_silu_mul_quant( 2025-05-07T20:32:52.3808876Z self, 2025-05-07T20:32:52.3808959Z T: int, 2025-05-07T20:32:52.3809037Z D: int, 2025-05-07T20:32:52.3809136Z scale_ub: Optional[float], 2025-05-07T20:32:52.3809229Z contiguous: bool, 2025-05-07T20:32:52.3809318Z compiled: bool, 2025-05-07T20:32:52.3809394Z ) -> None: 2025-05-07T20:32:52.3809494Z torch.manual_seed(2025) 2025-05-07T20:32:52.3809567Z 2025-05-07T20:32:52.3809735Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3809811Z 2025-05-07T20:32:52.3809903Z > x_sign = torch.sign(x) 2025-05-07T20:32:52.3811813Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3811819Z 2025-05-07T20:32:52.3815432Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:52.3815442Z 2025-05-07T20:32:52.3815575Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3815811Z self=, 2025-05-07T20:32:52.3815889Z T=16384, 2025-05-07T20:32:52.3815964Z D=5120, 2025-05-07T20:32:52.3816048Z scale_ub=None, 2025-05-07T20:32:52.3816142Z contiguous=True, 2025-05-07T20:32:52.3816224Z compiled=False, 2025-05-07T20:32:52.3816303Z ) 2025-05-07T20:32:52.3816524Z self = 2025-05-07T20:32:52.3816703Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:52.3816708Z 2025-05-07T20:32:52.3816787Z @given( 2025-05-07T20:32:52.3816904Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3817001Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3817122Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3817237Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3817358Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3817432Z ) 2025-05-07T20:32:52.3817683Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3817785Z def test_silu_mul_quant( 2025-05-07T20:32:52.3817864Z self, 2025-05-07T20:32:52.3817941Z T: int, 2025-05-07T20:32:52.3818021Z D: int, 2025-05-07T20:32:52.3818117Z scale_ub: Optional[float], 2025-05-07T20:32:52.3818203Z contiguous: bool, 2025-05-07T20:32:52.3818290Z compiled: bool, 2025-05-07T20:32:52.3818367Z ) -> None: 2025-05-07T20:32:52.3818460Z torch.manual_seed(2025) 2025-05-07T20:32:52.3818533Z 2025-05-07T20:32:52.3818701Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3820716Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3820794Z 2025-05-07T20:32:52.3820914Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:52.3820918Z 2025-05-07T20:32:52.3821026Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3821258Z self=, 2025-05-07T20:32:52.3821334Z T=4096, 2025-05-07T20:32:52.3821420Z D=5120, 2025-05-07T20:32:52.3821507Z scale_ub=None, 2025-05-07T20:32:52.3821592Z contiguous=True, 2025-05-07T20:32:52.3821679Z compiled=False, 2025-05-07T20:32:52.3821753Z ) 2025-05-07T20:32:52.3821979Z self = 2025-05-07T20:32:52.3822162Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:52.3822167Z 2025-05-07T20:32:52.3822247Z @given( 2025-05-07T20:32:52.3822367Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3822475Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3822588Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3822710Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3822823Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3822900Z ) 2025-05-07T20:32:52.3823153Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3823255Z def test_silu_mul_quant( 2025-05-07T20:32:52.3823334Z self, 2025-05-07T20:32:52.3823419Z T: int, 2025-05-07T20:32:52.3823498Z D: int, 2025-05-07T20:32:52.3823600Z scale_ub: Optional[float], 2025-05-07T20:32:52.3823695Z contiguous: bool, 2025-05-07T20:32:52.3823783Z compiled: bool, 2025-05-07T20:32:52.3823862Z ) -> None: 2025-05-07T20:32:52.3823954Z torch.manual_seed(2025) 2025-05-07T20:32:52.3824028Z 2025-05-07T20:32:52.3824198Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3826258Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3826265Z 2025-05-07T20:32:52.3826386Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:52.3826395Z 2025-05-07T20:32:52.3826495Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3826723Z self=, 2025-05-07T20:32:52.3826806Z T=2048, 2025-05-07T20:32:52.3826886Z D=5120, 2025-05-07T20:32:52.3826968Z scale_ub=None, 2025-05-07T20:32:52.3827057Z contiguous=False, 2025-05-07T20:32:52.3827140Z compiled=False, 2025-05-07T20:32:52.3827218Z ) 2025-05-07T20:32:52.3827435Z self = 2025-05-07T20:32:52.3827608Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:52.3827613Z 2025-05-07T20:32:52.3827693Z @given( 2025-05-07T20:32:52.3827812Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3827910Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3828024Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3828273Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3828390Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3828464Z ) 2025-05-07T20:32:52.3828711Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3828924Z def test_silu_mul_quant( 2025-05-07T20:32:52.3828999Z self, 2025-05-07T20:32:52.3829074Z T: int, 2025-05-07T20:32:52.3829152Z D: int, 2025-05-07T20:32:52.3829246Z scale_ub: Optional[float], 2025-05-07T20:32:52.3829332Z contiguous: bool, 2025-05-07T20:32:52.3829421Z compiled: bool, 2025-05-07T20:32:52.3829496Z ) -> None: 2025-05-07T20:32:52.3829589Z torch.manual_seed(2025) 2025-05-07T20:32:52.3829665Z 2025-05-07T20:32:52.3829830Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3831737Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3831748Z 2025-05-07T20:32:52.3831862Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:52.3831866Z 2025-05-07T20:32:52.3831968Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3832191Z self=, 2025-05-07T20:32:52.3832264Z T=4096, 2025-05-07T20:32:52.3832346Z D=7168, 2025-05-07T20:32:52.3832425Z scale_ub=None, 2025-05-07T20:32:52.3832507Z contiguous=True, 2025-05-07T20:32:52.3832590Z compiled=True, 2025-05-07T20:32:52.3832662Z ) 2025-05-07T20:32:52.3832886Z self = 2025-05-07T20:32:52.3833059Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:52.3833064Z 2025-05-07T20:32:52.3833142Z @given( 2025-05-07T20:32:52.3833263Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3833361Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3833472Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3833588Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3833702Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3833776Z ) 2025-05-07T20:32:52.3834034Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3834128Z def test_silu_mul_quant( 2025-05-07T20:32:52.3834203Z self, 2025-05-07T20:32:52.3834282Z T: int, 2025-05-07T20:32:52.3834358Z D: int, 2025-05-07T20:32:52.3834458Z scale_ub: Optional[float], 2025-05-07T20:32:52.3834549Z contiguous: bool, 2025-05-07T20:32:52.3834632Z compiled: bool, 2025-05-07T20:32:52.3834713Z ) -> None: 2025-05-07T20:32:52.3834805Z torch.manual_seed(2025) 2025-05-07T20:32:52.3834881Z 2025-05-07T20:32:52.3835049Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3836946Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3836952Z 2025-05-07T20:32:52.3837156Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:52.3837160Z 2025-05-07T20:32:52.3837266Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3837491Z self=, 2025-05-07T20:32:52.3837667Z T=2048, 2025-05-07T20:32:52.3837745Z D=5120, 2025-05-07T20:32:52.3837829Z scale_ub=1200.0, 2025-05-07T20:32:52.3837917Z contiguous=False, 2025-05-07T20:32:52.3837998Z compiled=False, 2025-05-07T20:32:52.3838075Z ) 2025-05-07T20:32:52.3838293Z self = 2025-05-07T20:32:52.3838466Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:52.3838470Z 2025-05-07T20:32:52.3838552Z @given( 2025-05-07T20:32:52.3838668Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3838764Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3838885Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3839004Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3839119Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3839191Z ) 2025-05-07T20:32:52.3839437Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3839537Z def test_silu_mul_quant( 2025-05-07T20:32:52.3839612Z self, 2025-05-07T20:32:52.3839688Z T: int, 2025-05-07T20:32:52.3839767Z D: int, 2025-05-07T20:32:52.3839862Z scale_ub: Optional[float], 2025-05-07T20:32:52.3839949Z contiguous: bool, 2025-05-07T20:32:52.3840038Z compiled: bool, 2025-05-07T20:32:52.3840116Z ) -> None: 2025-05-07T20:32:52.3840208Z torch.manual_seed(2025) 2025-05-07T20:32:52.3840287Z 2025-05-07T20:32:52.3840452Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3842351Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3842364Z 2025-05-07T20:32:52.3842479Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:52.3842483Z 2025-05-07T20:32:52.3842584Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3842807Z self=, 2025-05-07T20:32:52.3842884Z T=4096, 2025-05-07T20:32:52.3842962Z D=7168, 2025-05-07T20:32:52.3843044Z scale_ub=1200.0, 2025-05-07T20:32:52.3843128Z contiguous=True, 2025-05-07T20:32:52.3843222Z compiled=False, 2025-05-07T20:32:52.3843295Z ) 2025-05-07T20:32:52.3843512Z self = 2025-05-07T20:32:52.3843692Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:52.3843701Z 2025-05-07T20:32:52.3843778Z @given( 2025-05-07T20:32:52.3843895Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3843991Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3844102Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3844217Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3844328Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3844400Z ) 2025-05-07T20:32:52.3844650Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3844744Z def test_silu_mul_quant( 2025-05-07T20:32:52.3844818Z self, 2025-05-07T20:32:52.3844977Z T: int, 2025-05-07T20:32:52.3845055Z D: int, 2025-05-07T20:32:52.3845153Z scale_ub: Optional[float], 2025-05-07T20:32:52.3845241Z contiguous: bool, 2025-05-07T20:32:52.3845325Z compiled: bool, 2025-05-07T20:32:52.3845482Z ) -> None: 2025-05-07T20:32:52.3845573Z torch.manual_seed(2025) 2025-05-07T20:32:52.3845644Z 2025-05-07T20:32:52.3845814Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3847717Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3847723Z 2025-05-07T20:32:52.3847841Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:52.3847845Z 2025-05-07T20:32:52.3847944Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3848174Z self=, 2025-05-07T20:32:52.3848254Z T=16384, 2025-05-07T20:32:52.3848335Z D=7168, 2025-05-07T20:32:52.3848420Z scale_ub=None, 2025-05-07T20:32:52.3848513Z contiguous=False, 2025-05-07T20:32:52.3848599Z compiled=True, 2025-05-07T20:32:52.3848682Z ) 2025-05-07T20:32:52.3848901Z self = 2025-05-07T20:32:52.3849079Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:32:52.3849083Z 2025-05-07T20:32:52.3849165Z @given( 2025-05-07T20:32:52.3849282Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3849385Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3849505Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3849620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3849738Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3849820Z ) 2025-05-07T20:32:52.3850069Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3850168Z def test_silu_mul_quant( 2025-05-07T20:32:52.3850249Z self, 2025-05-07T20:32:52.3850330Z T: int, 2025-05-07T20:32:52.3850412Z D: int, 2025-05-07T20:32:52.3850512Z scale_ub: Optional[float], 2025-05-07T20:32:52.3850602Z contiguous: bool, 2025-05-07T20:32:52.3850696Z compiled: bool, 2025-05-07T20:32:52.3850773Z ) -> None: 2025-05-07T20:32:52.3850864Z torch.manual_seed(2025) 2025-05-07T20:32:52.3850940Z 2025-05-07T20:32:52.3851104Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3853010Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3853020Z 2025-05-07T20:32:52.3853135Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:52.3853139Z 2025-05-07T20:32:52.3853245Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3853471Z self=, 2025-05-07T20:32:52.3853547Z T=4096, 2025-05-07T20:32:52.3853627Z D=7168, 2025-05-07T20:32:52.3853793Z scale_ub=None, 2025-05-07T20:32:52.3853879Z contiguous=True, 2025-05-07T20:32:52.3853964Z compiled=False, 2025-05-07T20:32:52.3854036Z ) 2025-05-07T20:32:52.3854255Z self = 2025-05-07T20:32:52.3854642Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:52.3854648Z 2025-05-07T20:32:52.3854728Z @given( 2025-05-07T20:32:52.3854852Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3854953Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3855069Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3855190Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3855304Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3855380Z ) 2025-05-07T20:32:52.3855635Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3855739Z def test_silu_mul_quant( 2025-05-07T20:32:52.3855821Z self, 2025-05-07T20:32:52.3855902Z T: int, 2025-05-07T20:32:52.3855981Z D: int, 2025-05-07T20:32:52.3856082Z scale_ub: Optional[float], 2025-05-07T20:32:52.3856172Z contiguous: bool, 2025-05-07T20:32:52.3856264Z compiled: bool, 2025-05-07T20:32:52.3856344Z ) -> None: 2025-05-07T20:32:52.3856439Z torch.manual_seed(2025) 2025-05-07T20:32:52.3856514Z 2025-05-07T20:32:52.3856685Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3858587Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3858593Z 2025-05-07T20:32:52.3858710Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:52.3858718Z 2025-05-07T20:32:52.3858818Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3859039Z self=, 2025-05-07T20:32:52.3859118Z T=16384, 2025-05-07T20:32:52.3859194Z D=7168, 2025-05-07T20:32:52.3859278Z scale_ub=None, 2025-05-07T20:32:52.3859365Z contiguous=True, 2025-05-07T20:32:52.3859451Z compiled=False, 2025-05-07T20:32:52.3859526Z ) 2025-05-07T20:32:52.3859745Z self = 2025-05-07T20:32:52.3859919Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:32:52.3859923Z 2025-05-07T20:32:52.3860002Z @given( 2025-05-07T20:32:52.3860120Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3860216Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3860329Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3860443Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3860561Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3860634Z ) 2025-05-07T20:32:52.3860879Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3860976Z def test_silu_mul_quant( 2025-05-07T20:32:52.3861052Z self, 2025-05-07T20:32:52.3861126Z T: int, 2025-05-07T20:32:52.3861204Z D: int, 2025-05-07T20:32:52.3861300Z scale_ub: Optional[float], 2025-05-07T20:32:52.3861389Z contiguous: bool, 2025-05-07T20:32:52.3861474Z compiled: bool, 2025-05-07T20:32:52.3861551Z ) -> None: 2025-05-07T20:32:52.3861643Z torch.manual_seed(2025) 2025-05-07T20:32:52.3861717Z 2025-05-07T20:32:52.3861967Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3863877Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3863955Z 2025-05-07T20:32:52.3864074Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:52.3864078Z 2025-05-07T20:32:52.3864187Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3864416Z self=, 2025-05-07T20:32:52.3864503Z T=16384, 2025-05-07T20:32:52.3864581Z D=7168, 2025-05-07T20:32:52.3864665Z scale_ub=1200.0, 2025-05-07T20:32:52.3864748Z contiguous=True, 2025-05-07T20:32:52.3864833Z compiled=False, 2025-05-07T20:32:52.3864905Z ) 2025-05-07T20:32:52.3865130Z self = 2025-05-07T20:32:52.3865313Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:52.3865318Z 2025-05-07T20:32:52.3865394Z @given( 2025-05-07T20:32:52.3865512Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3865609Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3865718Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3865836Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3865947Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3866020Z ) 2025-05-07T20:32:52.3866273Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3866367Z def test_silu_mul_quant( 2025-05-07T20:32:52.3866442Z self, 2025-05-07T20:32:52.3866523Z T: int, 2025-05-07T20:32:52.3866601Z D: int, 2025-05-07T20:32:52.3866698Z scale_ub: Optional[float], 2025-05-07T20:32:52.3866791Z contiguous: bool, 2025-05-07T20:32:52.3866875Z compiled: bool, 2025-05-07T20:32:52.3866953Z ) -> None: 2025-05-07T20:32:52.3867045Z torch.manual_seed(2025) 2025-05-07T20:32:52.3867116Z 2025-05-07T20:32:52.3867283Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3869182Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3869188Z 2025-05-07T20:32:52.3869311Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:52.3869315Z 2025-05-07T20:32:52.3869416Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3869638Z self=, 2025-05-07T20:32:52.3869718Z T=128, 2025-05-07T20:32:52.3869794Z D=5120, 2025-05-07T20:32:52.3869879Z scale_ub=1200.0, 2025-05-07T20:32:52.3869966Z contiguous=False, 2025-05-07T20:32:52.3870048Z compiled=False, 2025-05-07T20:32:52.3870126Z ) 2025-05-07T20:32:52.3870341Z self = 2025-05-07T20:32:52.3870514Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:32:52.3870519Z 2025-05-07T20:32:52.3870704Z @given( 2025-05-07T20:32:52.3870823Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3870920Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3871039Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3871225Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3871344Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3871415Z ) 2025-05-07T20:32:52.3871662Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3871760Z def test_silu_mul_quant( 2025-05-07T20:32:52.3871837Z self, 2025-05-07T20:32:52.3871912Z T: int, 2025-05-07T20:32:52.3871990Z D: int, 2025-05-07T20:32:52.3872085Z scale_ub: Optional[float], 2025-05-07T20:32:52.3872172Z contiguous: bool, 2025-05-07T20:32:52.3872259Z compiled: bool, 2025-05-07T20:32:52.3872336Z ) -> None: 2025-05-07T20:32:52.3872433Z torch.manual_seed(2025) 2025-05-07T20:32:52.3872509Z 2025-05-07T20:32:52.3872676Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3872750Z 2025-05-07T20:32:52.3872847Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3872977Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3873068Z x = x_sign * x_clamp 2025-05-07T20:32:52.3873146Z x0 = x[:, :D] 2025-05-07T20:32:52.3873225Z x1 = x[:, D:] 2025-05-07T20:32:52.3873297Z 2025-05-07T20:32:52.3873379Z if contiguous: 2025-05-07T20:32:52.3873472Z x0 = x0.contiguous() 2025-05-07T20:32:52.3873564Z x1 = x1.contiguous() 2025-05-07T20:32:52.3873635Z 2025-05-07T20:32:52.3873726Z if scale_ub is not None: 2025-05-07T20:32:52.3873835Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3873972Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3874045Z ) 2025-05-07T20:32:52.3874127Z else: 2025-05-07T20:32:52.3874220Z scale_ub_tensor = None 2025-05-07T20:32:52.3874295Z 2025-05-07T20:32:52.3874431Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3874539Z op = silu_mul_quant 2025-05-07T20:32:52.3874642Z if compiled: 2025-05-07T20:32:52.3874753Z op = torch.compile(op) 2025-05-07T20:32:52.3874857Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3874932Z 2025-05-07T20:32:52.3875020Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3875025Z 2025-05-07T20:32:52.3875125Z moe/activation_test.py:117: 2025-05-07T20:32:52.3875255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3875354Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3875456Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3875986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3876083Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3876461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3876699Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3877056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3877149Z kernel = self.compile( 2025-05-07T20:32:52.3877552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3877726Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3877855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3877863Z 2025-05-07T20:32:52.3878152Z self = 2025-05-07T20:32:52.3878965Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3879557Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f121b943060>} 2025-05-07T20:32:52.3880349Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3880549Z context = 2025-05-07T20:32:52.3880553Z 2025-05-07T20:32:52.3880722Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3881001Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3881113Z module_map=module_map) 2025-05-07T20:32:52.3881278Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3881386Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3881465Z E ^ 2025-05-07T20:32:52.3881833Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3881837Z 2025-05-07T20:32:52.3882272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3882276Z 2025-05-07T20:32:52.3882376Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3882601Z self=, 2025-05-07T20:32:52.3882684Z T=2048, 2025-05-07T20:32:52.3882757Z D=7168, 2025-05-07T20:32:52.3882844Z scale_ub=None, 2025-05-07T20:32:52.3882933Z contiguous=False, 2025-05-07T20:32:52.3883015Z compiled=False, 2025-05-07T20:32:52.3883089Z ) 2025-05-07T20:32:52.3883307Z self = 2025-05-07T20:32:52.3883480Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:32:52.3883489Z 2025-05-07T20:32:52.3883567Z @given( 2025-05-07T20:32:52.3883682Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3883783Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3883897Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3884011Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3884126Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3884198Z ) 2025-05-07T20:32:52.3884444Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3884539Z def test_silu_mul_quant( 2025-05-07T20:32:52.3884638Z self, 2025-05-07T20:32:52.3884721Z T: int, 2025-05-07T20:32:52.3884817Z D: int, 2025-05-07T20:32:52.3884920Z scale_ub: Optional[float], 2025-05-07T20:32:52.3885008Z contiguous: bool, 2025-05-07T20:32:52.3885097Z compiled: bool, 2025-05-07T20:32:52.3885177Z ) -> None: 2025-05-07T20:32:52.3885275Z torch.manual_seed(2025) 2025-05-07T20:32:52.3885352Z 2025-05-07T20:32:52.3885524Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3887509Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3887515Z 2025-05-07T20:32:52.3887636Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:52.3887640Z 2025-05-07T20:32:52.3887746Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3888046Z self=, 2025-05-07T20:32:52.3888122Z T=128, 2025-05-07T20:32:52.3888201Z D=7168, 2025-05-07T20:32:52.3888284Z scale_ub=1200.0, 2025-05-07T20:32:52.3888368Z contiguous=True, 2025-05-07T20:32:52.3888452Z compiled=True, 2025-05-07T20:32:52.3888526Z ) 2025-05-07T20:32:52.3888745Z self = 2025-05-07T20:32:52.3888917Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:52.3888921Z 2025-05-07T20:32:52.3888997Z @given( 2025-05-07T20:32:52.3889116Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3889223Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3889333Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3889452Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3889563Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3889641Z ) 2025-05-07T20:32:52.3889894Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3889990Z def test_silu_mul_quant( 2025-05-07T20:32:52.3890065Z self, 2025-05-07T20:32:52.3890145Z T: int, 2025-05-07T20:32:52.3890219Z D: int, 2025-05-07T20:32:52.3890318Z scale_ub: Optional[float], 2025-05-07T20:32:52.3890406Z contiguous: bool, 2025-05-07T20:32:52.3890490Z compiled: bool, 2025-05-07T20:32:52.3890569Z ) -> None: 2025-05-07T20:32:52.3890664Z torch.manual_seed(2025) 2025-05-07T20:32:52.3890739Z 2025-05-07T20:32:52.3890915Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3890991Z 2025-05-07T20:32:52.3891088Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3891216Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3891306Z x = x_sign * x_clamp 2025-05-07T20:32:52.3891391Z x0 = x[:, :D] 2025-05-07T20:32:52.3891474Z x1 = x[:, D:] 2025-05-07T20:32:52.3891548Z 2025-05-07T20:32:52.3891635Z if contiguous: 2025-05-07T20:32:52.3891729Z x0 = x0.contiguous() 2025-05-07T20:32:52.3891819Z x1 = x1.contiguous() 2025-05-07T20:32:52.3891898Z 2025-05-07T20:32:52.3891990Z if scale_ub is not None: 2025-05-07T20:32:52.3892096Z scale_ub_tensor = torch.tensor( 2025-05-07T20:32:52.3892238Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:32:52.3892317Z ) 2025-05-07T20:32:52.3892398Z else: 2025-05-07T20:32:52.3892500Z scale_ub_tensor = None 2025-05-07T20:32:52.3892575Z 2025-05-07T20:32:52.3892705Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:32:52.3892799Z op = silu_mul_quant 2025-05-07T20:32:52.3892882Z if compiled: 2025-05-07T20:32:52.3892982Z op = torch.compile(op) 2025-05-07T20:32:52.3893096Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3893170Z 2025-05-07T20:32:52.3893262Z > y_fp8, y_scale = fn() 2025-05-07T20:32:52.3893267Z 2025-05-07T20:32:52.3893364Z moe/activation_test.py:117: 2025-05-07T20:32:52.3893494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3893596Z moe/activation_test.py:115: in fn 2025-05-07T20:32:52.3893693Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:32:52.3894127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:32:52.3894222Z return fn(*args, **kwargs) 2025-05-07T20:32:52.3894993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:32:52.3895103Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:32:52.3895476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:32:52.3895780Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:32:52.3896139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:32:52.3896231Z kernel = self.compile( 2025-05-07T20:32:52.3896629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:32:52.3896807Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:32:52.3896935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:32:52.3896940Z 2025-05-07T20:32:52.3897154Z self = 2025-05-07T20:32:52.3897961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:32:52.3898482Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f121b7cc900>} 2025-05-07T20:32:52.3899272Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:32:52.3899465Z context = 2025-05-07T20:32:52.3899470Z 2025-05-07T20:32:52.3899646Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:32:52.3899917Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:32:52.3900024Z module_map=module_map) 2025-05-07T20:32:52.3900185Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:32:52.3900286Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:32:52.3900363Z E ^ 2025-05-07T20:32:52.3900735Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:32:52.3900740Z 2025-05-07T20:32:52.3901173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:32:52.3901180Z 2025-05-07T20:32:52.3901283Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3901510Z self=, 2025-05-07T20:32:52.3901591Z T=128, 2025-05-07T20:32:52.3901667Z D=7168, 2025-05-07T20:32:52.3901754Z scale_ub=1200.0, 2025-05-07T20:32:52.3901842Z contiguous=True, 2025-05-07T20:32:52.3901928Z compiled=False, 2025-05-07T20:32:52.3902000Z ) 2025-05-07T20:32:52.3902223Z self = 2025-05-07T20:32:52.3902400Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:32:52.3902404Z 2025-05-07T20:32:52.3902484Z @given( 2025-05-07T20:32:52.3902600Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3902698Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3902814Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3902929Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3903039Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3903115Z ) 2025-05-07T20:32:52.3903363Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3903556Z def test_silu_mul_quant( 2025-05-07T20:32:52.3903634Z self, 2025-05-07T20:32:52.3903709Z T: int, 2025-05-07T20:32:52.3903785Z D: int, 2025-05-07T20:32:52.3903882Z scale_ub: Optional[float], 2025-05-07T20:32:52.3904041Z contiguous: bool, 2025-05-07T20:32:52.3904127Z compiled: bool, 2025-05-07T20:32:52.3904200Z ) -> None: 2025-05-07T20:32:52.3904290Z torch.manual_seed(2025) 2025-05-07T20:32:52.3904363Z 2025-05-07T20:32:52.3904529Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3904602Z 2025-05-07T20:32:52.3904695Z x_sign = torch.sign(x) 2025-05-07T20:32:52.3904815Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:32:52.3906724Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3906735Z 2025-05-07T20:32:52.3906849Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:32:52.3906854Z 2025-05-07T20:32:52.3906951Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3907177Z self=, 2025-05-07T20:32:52.3907253Z T=128, 2025-05-07T20:32:52.3907337Z D=5120, 2025-05-07T20:32:52.3907416Z scale_ub=1200.0, 2025-05-07T20:32:52.3907497Z contiguous=True, 2025-05-07T20:32:52.3907579Z compiled=True, 2025-05-07T20:32:52.3907647Z ) 2025-05-07T20:32:52.3907864Z self = 2025-05-07T20:32:52.3908037Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:32:52.3908042Z 2025-05-07T20:32:52.3908115Z @given( 2025-05-07T20:32:52.3908227Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3908330Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3908439Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3908554Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3908662Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3908734Z ) 2025-05-07T20:32:52.3908985Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3909075Z def test_silu_mul_quant( 2025-05-07T20:32:52.3909148Z self, 2025-05-07T20:32:52.3909225Z T: int, 2025-05-07T20:32:52.3909298Z D: int, 2025-05-07T20:32:52.3909390Z scale_ub: Optional[float], 2025-05-07T20:32:52.3909478Z contiguous: bool, 2025-05-07T20:32:52.3909565Z compiled: bool, 2025-05-07T20:32:52.3909639Z ) -> None: 2025-05-07T20:32:52.3909733Z torch.manual_seed(2025) 2025-05-07T20:32:52.3909803Z 2025-05-07T20:32:52.3909975Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3910052Z 2025-05-07T20:32:52.3910141Z > x_sign = torch.sign(x) 2025-05-07T20:32:52.3912049Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3912055Z 2025-05-07T20:32:52.3912251Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:32:52.3912255Z 2025-05-07T20:32:52.3912363Z Trying example: test_silu_mul_quant( 2025-05-07T20:32:52.3912589Z self=, 2025-05-07T20:32:52.3912741Z T=128, 2025-05-07T20:32:52.3912820Z D=7168, 2025-05-07T20:32:52.3912897Z scale_ub=None, 2025-05-07T20:32:52.3912979Z contiguous=True, 2025-05-07T20:32:52.3913059Z compiled=True, 2025-05-07T20:32:52.3913130Z ) 2025-05-07T20:32:52.3913348Z self = 2025-05-07T20:32:52.3913516Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:32:52.3913521Z 2025-05-07T20:32:52.3913594Z @given( 2025-05-07T20:32:52.3913709Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:32:52.3913804Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:32:52.3913919Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:32:52.3914056Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:32:52.3914176Z compiled=st.sampled_from([True, False]), 2025-05-07T20:32:52.3914265Z ) 2025-05-07T20:32:52.3914519Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:32:52.3914612Z def test_silu_mul_quant( 2025-05-07T20:32:52.3914684Z self, 2025-05-07T20:32:52.3914756Z T: int, 2025-05-07T20:32:52.3914828Z D: int, 2025-05-07T20:32:52.3914924Z scale_ub: Optional[float], 2025-05-07T20:32:52.3915010Z contiguous: bool, 2025-05-07T20:32:52.3915093Z compiled: bool, 2025-05-07T20:32:52.3915169Z ) -> None: 2025-05-07T20:32:52.3915259Z torch.manual_seed(2025) 2025-05-07T20:32:52.3915329Z 2025-05-07T20:32:52.3915497Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:32:52.3917394Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:32:52.3917405Z 2025-05-07T20:32:52.3917519Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:32:52.3917651Z =============================== warnings summary =============================== 2025-05-07T20:32:52.3917967Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:52.3918276Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:52.3918584Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:32:52.3919514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:32:52.3919751Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:32:52.3919755Z 2025-05-07T20:32:52.3919935Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings 2025-05-07T20:32:52.3921376Z /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844. 2025-05-07T20:32:52.3921570Z torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3) 2025-05-07T20:32:52.3921578Z 2025-05-07T20:32:52.3921793Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:32:52.3922031Z ================== 1 failed, 1 passed, 13 warnings in 20.57s =================== 2025-05-07T20:32:54.2490121Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:32:54.3144448Z 2025-05-07T20:32:54.3144861Z [TEST] Some tests FAILED. Re-attempting only FAILED tests: ./moe/activation_test.py 2025-05-07T20:32:54.3145232Z 2025-05-07T20:32:54.3145236Z 2025-05-07T20:32:54.3168165Z [EXEC] [ATTEMPT 0/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:32:56.4880275Z ============================= test session starts ============================== 2025-05-07T20:32:56.4880908Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:32:56.4881454Z cachedir: .pytest_cache 2025-05-07T20:32:56.4882038Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:32:56.4882783Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:32:56.4883202Z plugins: hypothesis-6.131.14 2025-05-07T20:32:58.1262337Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:32:58.2359117Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:32:58.2359711Z run-last-failure: rerun previous 1 failure 2025-05-07T20:32:58.2360012Z 2025-05-07T20:33:00.3994938Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:33:00.3996087Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:33:00.3997514Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:33:00.3999037Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:33:00.4001655Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:00.4003037Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:33:00.4004484Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.4005516Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:00.4006809Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:00.4008603Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.4009724Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:00.4011211Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:33:00.4012527Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:33:00.4013813Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:33:00.4015245Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:33:00.4016116Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:00.4017192Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:33:00.4018257Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:33:00.4019086Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^ 2025-05-07T20:33:00.4020369Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:33:00.4021710Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:33:00.4022885Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:33:00.4023999Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:33:00.4031926Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:33:00.4033397Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:33:00.4034513Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.4035483Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.4036265Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:33:00.4037339Z W0507 20:33:00.397000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.4157868Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:33:00.4159231Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last): 2025-05-07T20:33:00.4160646Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:33:00.4162268Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:33:00.4163292Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:00.4164673Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:33:00.4166125Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.4167162Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:00.4168460Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:00.4169909Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.4171033Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:00.4172379Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:33:00.4173705Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] generator.visit(fn.parse()) 2025-05-07T20:33:00.4175133Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:33:00.4176428Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ret = super().visit(node) 2025-05-07T20:33:00.4177326Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:00.4178392Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:33:00.4179466Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] return visitor(node) 2025-05-07T20:33:00.4180299Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^^^^^^^^^^^^^ 2025-05-07T20:33:00.4181574Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:33:00.4183013Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:33:00.4184195Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:33:00.4185378Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] self.visit(item) 2025-05-07T20:33:00.4186624Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:33:00.4188062Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:33:00.4189185Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.4190138Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant( 2025-05-07T20:33:00.4190920Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^ 2025-05-07T20:33:00.4191992Z W0507 20:33:00.414000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.8359927Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.8360646Z self=, 2025-05-07T20:33:00.8361076Z T=1, 2025-05-07T20:33:00.8361260Z D=5120, 2025-05-07T20:33:00.8361457Z scale_ub=None, 2025-05-07T20:33:00.8361711Z contiguous=True, 2025-05-07T20:33:00.8361930Z compiled=True, 2025-05-07T20:33:00.8362141Z ) 2025-05-07T20:33:00.8362469Z self = 2025-05-07T20:33:00.8362965Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:00.8363254Z 2025-05-07T20:33:00.8363335Z @given( 2025-05-07T20:33:00.8363569Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:00.8363888Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:00.8364208Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:00.8364553Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:00.8364893Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:00.8365187Z ) 2025-05-07T20:33:00.8365549Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:00.8366010Z def test_silu_mul_quant( 2025-05-07T20:33:00.8366257Z self, 2025-05-07T20:33:00.8366465Z T: int, 2025-05-07T20:33:00.8366668Z D: int, 2025-05-07T20:33:00.8366891Z scale_ub: Optional[float], 2025-05-07T20:33:00.8367178Z contiguous: bool, 2025-05-07T20:33:00.8367425Z compiled: bool, 2025-05-07T20:33:00.8367658Z ) -> None: 2025-05-07T20:33:00.8367884Z torch.manual_seed(2025) 2025-05-07T20:33:00.8368131Z 2025-05-07T20:33:00.8368408Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:00.8368780Z 2025-05-07T20:33:00.8368988Z x_sign = torch.sign(x) 2025-05-07T20:33:00.8369296Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:00.8369614Z x = x_sign * x_clamp 2025-05-07T20:33:00.8369863Z x0 = x[:, :D] 2025-05-07T20:33:00.8370089Z x1 = x[:, D:] 2025-05-07T20:33:00.8370295Z 2025-05-07T20:33:00.8370486Z if contiguous: 2025-05-07T20:33:00.8370725Z x0 = x0.contiguous() 2025-05-07T20:33:00.8371344Z x1 = x1.contiguous() 2025-05-07T20:33:00.8371599Z 2025-05-07T20:33:00.8371800Z if scale_ub is not None: 2025-05-07T20:33:00.8372076Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:00.8372423Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:00.8372890Z ) 2025-05-07T20:33:00.8373083Z else: 2025-05-07T20:33:00.8373299Z scale_ub_tensor = None 2025-05-07T20:33:00.8373566Z 2025-05-07T20:33:00.8373794Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.8374119Z op = silu_mul_quant 2025-05-07T20:33:00.8374483Z if compiled: 2025-05-07T20:33:00.8374736Z op = torch.compile(op) 2025-05-07T20:33:00.8375037Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:00.8375323Z 2025-05-07T20:33:00.8375517Z y_fp8, y_scale = fn() 2025-05-07T20:33:00.8375799Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:00.8376110Z 2025-05-07T20:33:00.8376390Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:00.8376726Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:00.8377027Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:00.8377352Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:00.8377719Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.8378039Z 2025-05-07T20:33:00.8378242Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:00.8378444Z 2025-05-07T20:33:00.8378555Z moe/activation_test.py:126: 2025-05-07T20:33:00.8378853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.8379202Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:00.8379541Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:00.8380373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:00.8381168Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:00.8381742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:00.8382464Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:00.8383190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:00.8383957Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:00.8384732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:00.8385403Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:00.8386041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:00.8386590Z fn() 2025-05-07T20:33:00.8387131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:00.8387749Z self.fn.run( 2025-05-07T20:33:00.8388246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:00.8388814Z kernel = self.compile( 2025-05-07T20:33:00.8389378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:00.8390071Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:00.8390489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:00.8390726Z 2025-05-07T20:33:00.8390944Z self = 2025-05-07T20:33:00.8392166Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:00.8393629Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f99497ecc20>} 2025-05-07T20:33:00.8395131Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:00.8396223Z context = 2025-05-07T20:33:00.8396525Z 2025-05-07T20:33:00.8396706Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:00.8397252Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:00.8397746Z module_map=module_map) 2025-05-07T20:33:00.8398135Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:00.8398505Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:00.8398796Z E ^ 2025-05-07T20:33:00.8399287Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:00.8399771Z 2025-05-07T20:33:00.8400220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:00.8400766Z 2025-05-07T20:33:00.8400874Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:00.8401310Z self=, 2025-05-07T20:33:00.8401738Z T=2048, 2025-05-07T20:33:00.8401934Z D=5120, 2025-05-07T20:33:00.8402130Z scale_ub=1200.0, 2025-05-07T20:33:00.8402355Z contiguous=True, 2025-05-07T20:33:00.8402576Z compiled=False, 2025-05-07T20:33:00.8402788Z ) 2025-05-07T20:33:01.2872141Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:33:01.2873323Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:33:01.2874768Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:33:01.2876307Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:33:01.2877347Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:01.2878728Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:33:01.2880216Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.2881258Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:01.2882561Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:01.2884415Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.2885550Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:01.2887119Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:33:01.2888456Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:33:01.2889751Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:33:01.2891030Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:33:01.2891892Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:01.2892985Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:33:01.2894064Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:33:01.2895019Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^ 2025-05-07T20:33:01.2896308Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:33:01.2897657Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:33:01.2898844Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:33:01.2899952Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:33:01.2901203Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:33:01.2902645Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:33:01.2903760Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.2904718Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.2905502Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:33:01.2906577Z W0507 20:33:01.283000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.3776168Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:33:01.3777517Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last): 2025-05-07T20:33:01.3778927Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:33:01.3780551Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:33:01.3781571Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:01.3782946Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:33:01.3784394Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.3785443Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:01.3786733Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:01.3788183Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.3789298Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:01.3790657Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:33:01.3791985Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] generator.visit(fn.parse()) 2025-05-07T20:33:01.3793277Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:33:01.3794550Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ret = super().visit(node) 2025-05-07T20:33:01.3795408Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:01.3796484Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:33:01.3797559Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] return visitor(node) 2025-05-07T20:33:01.3798391Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^^^^^^^^^^^^^ 2025-05-07T20:33:01.3799657Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:33:01.3801090Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:33:01.3802262Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:33:01.3803434Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] self.visit(item) 2025-05-07T20:33:01.3804675Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:33:01.3806093Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:33:01.3807205Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.3808154Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.3808929Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^ 2025-05-07T20:33:01.3809989Z W0507 20:33:01.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.8416858Z self = 2025-05-07T20:33:01.8417503Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:01.8417812Z 2025-05-07T20:33:01.8417901Z @given( 2025-05-07T20:33:01.8418142Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.8418482Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.8418796Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.8419135Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.8419463Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.8419770Z ) 2025-05-07T20:33:01.8420124Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.8420579Z def test_silu_mul_quant( 2025-05-07T20:33:01.8420818Z self, 2025-05-07T20:33:01.8421015Z T: int, 2025-05-07T20:33:01.8421215Z D: int, 2025-05-07T20:33:01.8421432Z scale_ub: Optional[float], 2025-05-07T20:33:01.8421707Z contiguous: bool, 2025-05-07T20:33:01.8421947Z compiled: bool, 2025-05-07T20:33:01.8422167Z ) -> None: 2025-05-07T20:33:01.8422382Z torch.manual_seed(2025) 2025-05-07T20:33:01.8422627Z 2025-05-07T20:33:01.8422902Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.8423255Z 2025-05-07T20:33:01.8423449Z x_sign = torch.sign(x) 2025-05-07T20:33:01.8423735Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.8424050Z x = x_sign * x_clamp 2025-05-07T20:33:01.8424299Z x0 = x[:, :D] 2025-05-07T20:33:01.8424508Z x1 = x[:, D:] 2025-05-07T20:33:01.8424716Z 2025-05-07T20:33:01.8424910Z if contiguous: 2025-05-07T20:33:01.8425135Z x0 = x0.contiguous() 2025-05-07T20:33:01.8425567Z x1 = x1.contiguous() 2025-05-07T20:33:01.8425827Z 2025-05-07T20:33:01.8426026Z if scale_ub is not None: 2025-05-07T20:33:01.8426304Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.8426693Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.8427032Z ) 2025-05-07T20:33:01.8427229Z else: 2025-05-07T20:33:01.8427447Z scale_ub_tensor = None 2025-05-07T20:33:01.8427701Z 2025-05-07T20:33:01.8428156Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.8428487Z op = silu_mul_quant 2025-05-07T20:33:01.8428739Z if compiled: 2025-05-07T20:33:01.8428984Z op = torch.compile(op) 2025-05-07T20:33:01.8429445Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.8429736Z 2025-05-07T20:33:01.8429928Z > y_fp8, y_scale = fn() 2025-05-07T20:33:01.8430101Z 2025-05-07T20:33:01.8430206Z moe/activation_test.py:117: 2025-05-07T20:33:01.8430513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.8430859Z moe/activation_test.py:115: in fn 2025-05-07T20:33:01.8431147Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.8431877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:01.8432606Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:01.8433177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.8433898Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.8434629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.8435194Z kernel = self.compile( 2025-05-07T20:33:01.8435763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.8436458Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.8436866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.8437110Z 2025-05-07T20:33:01.8437324Z self = 2025-05-07T20:33:01.8438455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.8439898Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f99498a8180>} 2025-05-07T20:33:01.8441305Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.8442390Z context = 2025-05-07T20:33:01.8442698Z 2025-05-07T20:33:01.8442870Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.8443416Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.8443904Z module_map=module_map) 2025-05-07T20:33:01.8444283Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.8444650Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:01.8444922Z E ^ 2025-05-07T20:33:01.8445399Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.8445887Z 2025-05-07T20:33:01.8446327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.8446871Z 2025-05-07T20:33:01.8446983Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.8447408Z self=, 2025-05-07T20:33:01.8447832Z T=2048, 2025-05-07T20:33:01.8448028Z D=5120, 2025-05-07T20:33:01.8448235Z scale_ub=1200.0, 2025-05-07T20:33:01.8448460Z contiguous=True, 2025-05-07T20:33:01.8448691Z compiled=True, 2025-05-07T20:33:01.8448904Z ) 2025-05-07T20:33:01.8449315Z self = 2025-05-07T20:33:01.8449833Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:01.8450118Z 2025-05-07T20:33:01.8450207Z @given( 2025-05-07T20:33:01.8450513Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:01.8450841Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:01.8451160Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:01.8451495Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:01.8451840Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:01.8452144Z ) 2025-05-07T20:33:01.8452509Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:01.8452967Z def test_silu_mul_quant( 2025-05-07T20:33:01.8453214Z self, 2025-05-07T20:33:01.8453413Z T: int, 2025-05-07T20:33:01.8453610Z D: int, 2025-05-07T20:33:01.8453843Z scale_ub: Optional[float], 2025-05-07T20:33:01.8454123Z contiguous: bool, 2025-05-07T20:33:01.8454470Z compiled: bool, 2025-05-07T20:33:01.8454697Z ) -> None: 2025-05-07T20:33:01.8454911Z torch.manual_seed(2025) 2025-05-07T20:33:01.8455149Z 2025-05-07T20:33:01.8455424Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:01.8455777Z 2025-05-07T20:33:01.8455963Z x_sign = torch.sign(x) 2025-05-07T20:33:01.8456253Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:01.8456566Z x = x_sign * x_clamp 2025-05-07T20:33:01.8456800Z x0 = x[:, :D] 2025-05-07T20:33:01.8457013Z x1 = x[:, D:] 2025-05-07T20:33:01.8457222Z 2025-05-07T20:33:01.8457405Z if contiguous: 2025-05-07T20:33:01.8457632Z x0 = x0.contiguous() 2025-05-07T20:33:01.8457891Z x1 = x1.contiguous() 2025-05-07T20:33:01.8458135Z 2025-05-07T20:33:01.8458321Z if scale_ub is not None: 2025-05-07T20:33:01.8458601Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:01.8458940Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:01.8459245Z ) 2025-05-07T20:33:01.8459450Z else: 2025-05-07T20:33:01.8459676Z scale_ub_tensor = None 2025-05-07T20:33:01.8459933Z 2025-05-07T20:33:01.8460167Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.8460485Z op = silu_mul_quant 2025-05-07T20:33:01.8460732Z if compiled: 2025-05-07T20:33:01.8460980Z op = torch.compile(op) 2025-05-07T20:33:01.8461282Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:01.8461561Z 2025-05-07T20:33:01.8461755Z y_fp8, y_scale = fn() 2025-05-07T20:33:01.8462049Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:01.8462348Z 2025-05-07T20:33:01.8462577Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:01.8462925Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:01.8463224Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:01.8463538Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:01.8463907Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.8464227Z 2025-05-07T20:33:01.8464426Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:01.8464633Z 2025-05-07T20:33:01.8464729Z moe/activation_test.py:126: 2025-05-07T20:33:01.8465033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.8465383Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:01.8465711Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:01.8466536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:01.8467326Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:01.8467974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:01.8468695Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:01.8469420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:01.8470255Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:01.8471020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:01.8471698Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:01.8472332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:01.8472879Z fn() 2025-05-07T20:33:01.8473413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:01.8474030Z self.fn.run( 2025-05-07T20:33:01.8474516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:01.8475066Z kernel = self.compile( 2025-05-07T20:33:01.8475636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:01.8476320Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:01.8476732Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:01.8476992Z 2025-05-07T20:33:01.8477225Z self = 2025-05-07T20:33:01.8478350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:01.8479770Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9948439580>} 2025-05-07T20:33:01.8481169Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:01.8488681Z context = 2025-05-07T20:33:01.8489038Z 2025-05-07T20:33:01.8489224Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:01.8489843Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:01.8490330Z module_map=module_map) 2025-05-07T20:33:01.8490703Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:01.8491082Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:01.8491362Z E ^ 2025-05-07T20:33:01.8491835Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:01.8492316Z 2025-05-07T20:33:01.8492765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:01.8493318Z 2025-05-07T20:33:01.8493423Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:01.8493855Z self=, 2025-05-07T20:33:01.8494271Z T=16384, 2025-05-07T20:33:01.8494534Z D=7168, 2025-05-07T20:33:01.8494735Z scale_ub=1200.0, 2025-05-07T20:33:01.8494957Z contiguous=False, 2025-05-07T20:33:01.8495189Z compiled=False, 2025-05-07T20:33:01.8495403Z ) 2025-05-07T20:33:02.0962912Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:33:02.0965148Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:33:02.0967430Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:33:02.0969056Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:33:02.0970090Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:02.0971478Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:33:02.0972935Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.0973978Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:02.0975359Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:02.0976827Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.0977952Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:02.0979310Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:33:02.0980636Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:33:02.0981930Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:33:02.0983220Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:33:02.0984093Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:02.0985173Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:33:02.0986262Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:33:02.0987092Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^ 2025-05-07T20:33:02.0988371Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:33:02.0989797Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:33:02.0990982Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:33:02.0992165Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:33:02.0993414Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:33:02.0994855Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:33:02.0995969Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.0996979Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.0997767Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:33:02.0998848Z W0507 20:33:02.093000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.1597087Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:33:02.1598207Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last): 2025-05-07T20:33:02.1599604Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:33:02.1601106Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:33:02.1602132Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:02.1603520Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:33:02.1604983Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.1606014Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:02.1607365Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:02.1608826Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.1610113Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:02.1611472Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:33:02.1612891Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] generator.visit(fn.parse()) 2025-05-07T20:33:02.1614182Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:33:02.1615554Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ret = super().visit(node) 2025-05-07T20:33:02.1616435Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:02.1617512Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:33:02.1618584Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] return visitor(node) 2025-05-07T20:33:02.1619416Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^^^^^^^^^^^^^ 2025-05-07T20:33:02.1620687Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:33:02.1622048Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:33:02.1623217Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:33:02.1624319Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] self.visit(item) 2025-05-07T20:33:02.1625787Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:33:02.1627224Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:33:02.1628344Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.1629295Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.1630078Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^ 2025-05-07T20:33:02.1631159Z W0507 20:33:02.156000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.6724677Z self = 2025-05-07T20:33:02.6725256Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:02.6725761Z 2025-05-07T20:33:02.6725874Z @given( 2025-05-07T20:33:02.6726208Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.6726579Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.6727082Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.6727438Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.6727776Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.6728079Z ) 2025-05-07T20:33:02.6728588Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.6729045Z def test_silu_mul_quant( 2025-05-07T20:33:02.6729301Z self, 2025-05-07T20:33:02.6729505Z T: int, 2025-05-07T20:33:02.6729703Z D: int, 2025-05-07T20:33:02.6729930Z scale_ub: Optional[float], 2025-05-07T20:33:02.6730217Z contiguous: bool, 2025-05-07T20:33:02.6730459Z compiled: bool, 2025-05-07T20:33:02.6730695Z ) -> None: 2025-05-07T20:33:02.6730917Z torch.manual_seed(2025) 2025-05-07T20:33:02.6731157Z 2025-05-07T20:33:02.6731449Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.6731821Z 2025-05-07T20:33:02.6732031Z x_sign = torch.sign(x) 2025-05-07T20:33:02.6732351Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.6732686Z x = x_sign * x_clamp 2025-05-07T20:33:02.6732948Z x0 = x[:, :D] 2025-05-07T20:33:02.6733180Z x1 = x[:, D:] 2025-05-07T20:33:02.6733419Z 2025-05-07T20:33:02.6733624Z if contiguous: 2025-05-07T20:33:02.6733861Z x0 = x0.contiguous() 2025-05-07T20:33:02.6734144Z x1 = x1.contiguous() 2025-05-07T20:33:02.6734574Z 2025-05-07T20:33:02.6734777Z if scale_ub is not None: 2025-05-07T20:33:02.6735073Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.6735443Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.6735763Z ) 2025-05-07T20:33:02.6735981Z else: 2025-05-07T20:33:02.6736215Z scale_ub_tensor = None 2025-05-07T20:33:02.6736474Z 2025-05-07T20:33:02.6736777Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.6737119Z op = silu_mul_quant 2025-05-07T20:33:02.6737376Z if compiled: 2025-05-07T20:33:02.6737644Z op = torch.compile(op) 2025-05-07T20:33:02.6737957Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.6738251Z 2025-05-07T20:33:02.6738443Z > y_fp8, y_scale = fn() 2025-05-07T20:33:02.6738619Z 2025-05-07T20:33:02.6738723Z moe/activation_test.py:117: 2025-05-07T20:33:02.6739031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.6739375Z moe/activation_test.py:115: in fn 2025-05-07T20:33:02.6739674Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.6740410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:02.6741143Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:02.6741720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.6742445Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.6743148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.6743715Z kernel = self.compile( 2025-05-07T20:33:02.6744288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.6744989Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.6745402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.6745643Z 2025-05-07T20:33:02.6745857Z self = 2025-05-07T20:33:02.6747073Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.6748510Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9948439c60>} 2025-05-07T20:33:02.6749995Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.6751085Z context = 2025-05-07T20:33:02.6751399Z 2025-05-07T20:33:02.6751576Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.6752133Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.6752634Z module_map=module_map) 2025-05-07T20:33:02.6753017Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.6753391Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:02.6753674Z E ^ 2025-05-07T20:33:02.6754161Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.6754656Z 2025-05-07T20:33:02.6755098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.6755656Z 2025-05-07T20:33:02.6755769Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.6756207Z self=, 2025-05-07T20:33:02.6756686Z T=1, 2025-05-07T20:33:02.6756892Z D=7168, 2025-05-07T20:33:02.6757106Z scale_ub=None, 2025-05-07T20:33:02.6757327Z contiguous=True, 2025-05-07T20:33:02.6757577Z compiled=True, 2025-05-07T20:33:02.6757803Z ) 2025-05-07T20:33:02.6758140Z self = 2025-05-07T20:33:02.6758681Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:02.6758966Z 2025-05-07T20:33:02.6759050Z @given( 2025-05-07T20:33:02.6759298Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:02.6759742Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:02.6760065Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:02.6760411Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:02.6760764Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:02.6761056Z ) 2025-05-07T20:33:02.6761418Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:02.6761879Z def test_silu_mul_quant( 2025-05-07T20:33:02.6762123Z self, 2025-05-07T20:33:02.6762332Z T: int, 2025-05-07T20:33:02.6762543Z D: int, 2025-05-07T20:33:02.6762766Z scale_ub: Optional[float], 2025-05-07T20:33:02.6763065Z contiguous: bool, 2025-05-07T20:33:02.6763317Z compiled: bool, 2025-05-07T20:33:02.6763552Z ) -> None: 2025-05-07T20:33:02.6763771Z torch.manual_seed(2025) 2025-05-07T20:33:02.6764020Z 2025-05-07T20:33:02.6764303Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:02.6764669Z 2025-05-07T20:33:02.6764871Z x_sign = torch.sign(x) 2025-05-07T20:33:02.6765164Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:02.6765489Z x = x_sign * x_clamp 2025-05-07T20:33:02.6765735Z x0 = x[:, :D] 2025-05-07T20:33:02.6765949Z x1 = x[:, D:] 2025-05-07T20:33:02.6766160Z 2025-05-07T20:33:02.6766347Z if contiguous: 2025-05-07T20:33:02.6766582Z x0 = x0.contiguous() 2025-05-07T20:33:02.6766842Z x1 = x1.contiguous() 2025-05-07T20:33:02.6767096Z 2025-05-07T20:33:02.6767291Z if scale_ub is not None: 2025-05-07T20:33:02.6767565Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:02.6768000Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:02.6768323Z ) 2025-05-07T20:33:02.6768514Z else: 2025-05-07T20:33:02.6768732Z scale_ub_tensor = None 2025-05-07T20:33:02.6768986Z 2025-05-07T20:33:02.6769296Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.6769619Z op = silu_mul_quant 2025-05-07T20:33:02.6769877Z if compiled: 2025-05-07T20:33:02.6770117Z op = torch.compile(op) 2025-05-07T20:33:02.6770422Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:02.6770708Z 2025-05-07T20:33:02.6772373Z y_fp8, y_scale = fn() 2025-05-07T20:33:02.6772661Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:02.6772962Z 2025-05-07T20:33:02.6773203Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:02.6773537Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:02.6773843Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:02.6774169Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:02.6774659Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:02.6774984Z 2025-05-07T20:33:02.6775190Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:02.6775397Z 2025-05-07T20:33:02.6775495Z moe/activation_test.py:126: 2025-05-07T20:33:02.6775793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.6776135Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:02.6776466Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:02.6777287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:02.6778084Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:02.6778659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:02.6779371Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:02.6780092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:02.6780857Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:02.6781625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:02.6782291Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:02.6782922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:02.6783465Z fn() 2025-05-07T20:33:02.6783995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:02.6784603Z self.fn.run( 2025-05-07T20:33:02.6785095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:02.6785652Z kernel = self.compile( 2025-05-07T20:33:02.6786211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:02.6786908Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:02.6787327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:02.6787566Z 2025-05-07T20:33:02.6787784Z self = 2025-05-07T20:33:02.6788902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:02.6790407Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f994843ad40>} 2025-05-07T20:33:02.6791822Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:02.6792987Z context = 2025-05-07T20:33:02.6793289Z 2025-05-07T20:33:02.6793470Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:02.6794010Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:02.6794499Z module_map=module_map) 2025-05-07T20:33:02.6794878Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:02.6795248Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:02.6795526Z E ^ 2025-05-07T20:33:02.6796014Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:02.6796485Z 2025-05-07T20:33:02.6796931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:02.6797479Z 2025-05-07T20:33:02.6797585Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:02.6798012Z self=, 2025-05-07T20:33:02.6798435Z T=4096, 2025-05-07T20:33:02.6798617Z D=5120, 2025-05-07T20:33:02.6798812Z scale_ub=None, 2025-05-07T20:33:02.6799033Z contiguous=False, 2025-05-07T20:33:02.6799259Z compiled=False, 2025-05-07T20:33:02.6799465Z ) 2025-05-07T20:33:03.1575278Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:33:03.1576405Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:33:03.1578621Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:33:03.1581622Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:33:03.1583677Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:03.1586428Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:33:03.1588047Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.1589086Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:03.1590375Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:03.1591830Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.1593122Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:03.1594481Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:33:03.1595952Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:33:03.1597240Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:33:03.1598512Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:33:03.1599388Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:03.1600465Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:33:03.1601542Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:33:03.1602367Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^ 2025-05-07T20:33:03.1603644Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:33:03.1604997Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:33:03.1606169Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:33:03.1607273Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:33:03.1608509Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:33:03.1609941Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:33:03.1611061Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.1612013Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.1612779Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:33:03.1613851Z W0507 20:33:03.154000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.3769908Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:33:03.3771024Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last): 2025-05-07T20:33:03.3772624Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:33:03.3774230Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:33:03.3775351Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:03.3776724Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:33:03.3778192Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.3779223Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:03.3780520Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:03.3781968Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.3783089Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:03.3784450Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:33:03.3785775Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] generator.visit(fn.parse()) 2025-05-07T20:33:03.3787071Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:33:03.3788341Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ret = super().visit(node) 2025-05-07T20:33:03.3789216Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:03.3790301Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:33:03.3791379Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] return visitor(node) 2025-05-07T20:33:03.3792225Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^^^^^^^^^^^^^ 2025-05-07T20:33:03.3793509Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:33:03.3794868Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:33:03.3796135Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:33:03.3797290Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] self.visit(item) 2025-05-07T20:33:03.3798603Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:33:03.3800037Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:33:03.3801158Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.3802124Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.3802903Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^ 2025-05-07T20:33:03.3803973Z W0507 20:33:03.374000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.9786492Z self = 2025-05-07T20:33:03.9787071Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:03.9787384Z 2025-05-07T20:33:03.9787466Z @given( 2025-05-07T20:33:03.9787706Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.9788032Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.9788383Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.9788738Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.9789075Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.9789375Z ) 2025-05-07T20:33:03.9789729Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.9790184Z def test_silu_mul_quant( 2025-05-07T20:33:03.9790432Z self, 2025-05-07T20:33:03.9790632Z T: int, 2025-05-07T20:33:03.9790823Z D: int, 2025-05-07T20:33:03.9791039Z scale_ub: Optional[float], 2025-05-07T20:33:03.9791311Z contiguous: bool, 2025-05-07T20:33:03.9791552Z compiled: bool, 2025-05-07T20:33:03.9791769Z ) -> None: 2025-05-07T20:33:03.9791983Z torch.manual_seed(2025) 2025-05-07T20:33:03.9792224Z 2025-05-07T20:33:03.9792493Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.9792848Z 2025-05-07T20:33:03.9793043Z x_sign = torch.sign(x) 2025-05-07T20:33:03.9793335Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.9793656Z x = x_sign * x_clamp 2025-05-07T20:33:03.9793898Z x0 = x[:, :D] 2025-05-07T20:33:03.9794108Z x1 = x[:, D:] 2025-05-07T20:33:03.9794314Z 2025-05-07T20:33:03.9794496Z if contiguous: 2025-05-07T20:33:03.9794728Z x0 = x0.contiguous() 2025-05-07T20:33:03.9794989Z x1 = x1.contiguous() 2025-05-07T20:33:03.9795232Z 2025-05-07T20:33:03.9795416Z if scale_ub is not None: 2025-05-07T20:33:03.9795689Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.9796026Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.9796339Z ) 2025-05-07T20:33:03.9796527Z else: 2025-05-07T20:33:03.9796736Z scale_ub_tensor = None 2025-05-07T20:33:03.9796992Z 2025-05-07T20:33:03.9797226Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.9802881Z op = silu_mul_quant 2025-05-07T20:33:03.9803152Z if compiled: 2025-05-07T20:33:03.9803605Z op = torch.compile(op) 2025-05-07T20:33:03.9803920Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.9804207Z 2025-05-07T20:33:03.9804409Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.9804699Z 2025-05-07T20:33:03.9804802Z moe/activation_test.py:117: 2025-05-07T20:33:03.9805111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.9805458Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.9805753Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.9806481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.9807220Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.9807780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.9808512Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.9809210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.9809769Z kernel = self.compile( 2025-05-07T20:33:03.9810343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.9811046Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.9811465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.9811702Z 2025-05-07T20:33:03.9811911Z self = 2025-05-07T20:33:03.9813050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.9814647Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f994386c4a0>} 2025-05-07T20:33:03.9816057Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.9817149Z context = 2025-05-07T20:33:03.9817453Z 2025-05-07T20:33:03.9817623Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.9818158Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.9818644Z module_map=module_map) 2025-05-07T20:33:03.9819017Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.9819384Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.9819657Z E ^ 2025-05-07T20:33:03.9820139Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.9820621Z 2025-05-07T20:33:03.9821061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.9821618Z 2025-05-07T20:33:03.9821726Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.9822149Z self=, 2025-05-07T20:33:03.9822558Z T=4096, 2025-05-07T20:33:03.9822756Z D=7168, 2025-05-07T20:33:03.9822952Z scale_ub=None, 2025-05-07T20:33:03.9823168Z contiguous=False, 2025-05-07T20:33:03.9823400Z compiled=False, 2025-05-07T20:33:03.9823614Z ) 2025-05-07T20:33:03.9823942Z self = 2025-05-07T20:33:03.9824448Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:03.9824852Z 2025-05-07T20:33:03.9824937Z @given( 2025-05-07T20:33:03.9825175Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:03.9826685Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:03.9827018Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:03.9827573Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:03.9827909Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:03.9828207Z ) 2025-05-07T20:33:03.9828563Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:03.9829018Z def test_silu_mul_quant( 2025-05-07T20:33:03.9829259Z self, 2025-05-07T20:33:03.9829461Z T: int, 2025-05-07T20:33:03.9829653Z D: int, 2025-05-07T20:33:03.9829866Z scale_ub: Optional[float], 2025-05-07T20:33:03.9830140Z contiguous: bool, 2025-05-07T20:33:03.9830384Z compiled: bool, 2025-05-07T20:33:03.9830602Z ) -> None: 2025-05-07T20:33:03.9830823Z torch.manual_seed(2025) 2025-05-07T20:33:03.9831067Z 2025-05-07T20:33:03.9831338Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:03.9831693Z 2025-05-07T20:33:03.9831888Z x_sign = torch.sign(x) 2025-05-07T20:33:03.9832183Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:03.9832498Z x = x_sign * x_clamp 2025-05-07T20:33:03.9832740Z x0 = x[:, :D] 2025-05-07T20:33:03.9832953Z x1 = x[:, D:] 2025-05-07T20:33:03.9833167Z 2025-05-07T20:33:03.9833359Z if contiguous: 2025-05-07T20:33:03.9833586Z x0 = x0.contiguous() 2025-05-07T20:33:03.9833851Z x1 = x1.contiguous() 2025-05-07T20:33:03.9834103Z 2025-05-07T20:33:03.9834293Z if scale_ub is not None: 2025-05-07T20:33:03.9834565Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:03.9834911Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:03.9835232Z ) 2025-05-07T20:33:03.9835427Z else: 2025-05-07T20:33:03.9835637Z scale_ub_tensor = None 2025-05-07T20:33:03.9835895Z 2025-05-07T20:33:03.9836123Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:03.9836444Z op = silu_mul_quant 2025-05-07T20:33:03.9836709Z if compiled: 2025-05-07T20:33:03.9836956Z op = torch.compile(op) 2025-05-07T20:33:03.9837263Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.9837540Z 2025-05-07T20:33:03.9837726Z > y_fp8, y_scale = fn() 2025-05-07T20:33:03.9837896Z 2025-05-07T20:33:03.9837995Z moe/activation_test.py:117: 2025-05-07T20:33:03.9838299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.9838648Z moe/activation_test.py:115: in fn 2025-05-07T20:33:03.9838937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:03.9839662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:03.9840389Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:03.9840942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:03.9841664Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:03.9842358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:03.9842915Z kernel = self.compile( 2025-05-07T20:33:03.9843472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:03.9844161Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:03.9844574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:03.9844814Z 2025-05-07T20:33:03.9845157Z self = 2025-05-07T20:33:03.9846280Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:03.9847879Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f994386df80>} 2025-05-07T20:33:03.9849286Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:03.9850367Z context = 2025-05-07T20:33:03.9850665Z 2025-05-07T20:33:03.9850834Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:03.9851376Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:03.9851855Z module_map=module_map) 2025-05-07T20:33:03.9852228Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:03.9852588Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:03.9852853Z E ^ 2025-05-07T20:33:03.9853336Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:03.9853812Z 2025-05-07T20:33:03.9854247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:03.9854929Z 2025-05-07T20:33:03.9855035Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:03.9855456Z self=, 2025-05-07T20:33:03.9855873Z T=128, 2025-05-07T20:33:03.9856066Z D=7168, 2025-05-07T20:33:03.9856270Z scale_ub=None, 2025-05-07T20:33:03.9856498Z contiguous=False, 2025-05-07T20:33:03.9856721Z compiled=True, 2025-05-07T20:33:03.9856928Z ) 2025-05-07T20:33:04.0422882Z self = 2025-05-07T20:33:04.0423399Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:04.0423700Z 2025-05-07T20:33:04.0423787Z @given( 2025-05-07T20:33:04.0424050Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.0424372Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.0424688Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.0425016Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.0425353Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.0425937Z ) 2025-05-07T20:33:04.0426398Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.0426992Z def test_silu_mul_quant( 2025-05-07T20:33:04.0427351Z self, 2025-05-07T20:33:04.0427542Z T: int, 2025-05-07T20:33:04.0427734Z D: int, 2025-05-07T20:33:04.0427952Z scale_ub: Optional[float], 2025-05-07T20:33:04.0428220Z contiguous: bool, 2025-05-07T20:33:04.0428466Z compiled: bool, 2025-05-07T20:33:04.0428694Z ) -> None: 2025-05-07T20:33:04.0428901Z torch.manual_seed(2025) 2025-05-07T20:33:04.0429140Z 2025-05-07T20:33:04.0429418Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.0429770Z 2025-05-07T20:33:04.0429955Z x_sign = torch.sign(x) 2025-05-07T20:33:04.0430251Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.0430564Z x = x_sign * x_clamp 2025-05-07T20:33:04.0430798Z x0 = x[:, :D] 2025-05-07T20:33:04.0431011Z x1 = x[:, D:] 2025-05-07T20:33:04.0431218Z 2025-05-07T20:33:04.0431401Z if contiguous: 2025-05-07T20:33:04.0431630Z x0 = x0.contiguous() 2025-05-07T20:33:04.0432561Z x1 = x1.contiguous() 2025-05-07T20:33:04.0432812Z 2025-05-07T20:33:04.0433016Z if scale_ub is not None: 2025-05-07T20:33:04.0433292Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.0433625Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.0434056Z ) 2025-05-07T20:33:04.0434248Z else: 2025-05-07T20:33:04.0434450Z scale_ub_tensor = None 2025-05-07T20:33:04.0434709Z 2025-05-07T20:33:04.0434945Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.0435267Z op = silu_mul_quant 2025-05-07T20:33:04.0435510Z if compiled: 2025-05-07T20:33:04.0435752Z op = torch.compile(op) 2025-05-07T20:33:04.0436051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.0436322Z 2025-05-07T20:33:04.0436510Z y_fp8, y_scale = fn() 2025-05-07T20:33:04.0436800Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:04.0437092Z 2025-05-07T20:33:04.0437328Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.0437668Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:04.0437958Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:04.0438285Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:04.0438648Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:04.0438960Z 2025-05-07T20:33:04.0439152Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:04.0439356Z 2025-05-07T20:33:04.0439454Z moe/activation_test.py:126: 2025-05-07T20:33:04.0439749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0440085Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:04.0440412Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:04.0441232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:04.0442016Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:04.0442579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.0443293Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.0444015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:04.0444766Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:04.0445530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:04.0446199Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:04.0446827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:04.0447376Z fn() 2025-05-07T20:33:04.0447902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:04.0448508Z self.fn.run( 2025-05-07T20:33:04.0448989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.0449546Z kernel = self.compile( 2025-05-07T20:33:04.0450100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.0450779Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.0451184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.0451423Z 2025-05-07T20:33:04.0451637Z self = 2025-05-07T20:33:04.0452841Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.0454270Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f994386ec00>} 2025-05-07T20:33:04.0455859Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.0456936Z context = 2025-05-07T20:33:04.0457231Z 2025-05-07T20:33:04.0457416Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.0457976Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.0458454Z module_map=module_map) 2025-05-07T20:33:04.0458843Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.0459203Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:04.0459477Z E ^ 2025-05-07T20:33:04.0459963Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.0460441Z 2025-05-07T20:33:04.0460883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.0461424Z 2025-05-07T20:33:04.0461526Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.0461945Z self=, 2025-05-07T20:33:04.0462357Z T=128, 2025-05-07T20:33:04.0462537Z D=7168, 2025-05-07T20:33:04.0462730Z scale_ub=None, 2025-05-07T20:33:04.0462939Z contiguous=False, 2025-05-07T20:33:04.0463158Z compiled=False, 2025-05-07T20:33:04.0463364Z ) 2025-05-07T20:33:04.2448103Z self = 2025-05-07T20:33:04.2449193Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:04.2449759Z 2025-05-07T20:33:04.2449936Z @given( 2025-05-07T20:33:04.2450408Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.2451026Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.2451625Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.2452279Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.2452923Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.2453488Z ) 2025-05-07T20:33:04.2454175Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.2455200Z def test_silu_mul_quant( 2025-05-07T20:33:04.2455674Z self, 2025-05-07T20:33:04.2456049Z T: int, 2025-05-07T20:33:04.2456429Z D: int, 2025-05-07T20:33:04.2456863Z scale_ub: Optional[float], 2025-05-07T20:33:04.2457399Z contiguous: bool, 2025-05-07T20:33:04.2457688Z compiled: bool, 2025-05-07T20:33:04.2457935Z ) -> None: 2025-05-07T20:33:04.2458153Z torch.manual_seed(2025) 2025-05-07T20:33:04.2458393Z 2025-05-07T20:33:04.2458665Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.2459010Z 2025-05-07T20:33:04.2459201Z x_sign = torch.sign(x) 2025-05-07T20:33:04.2459486Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.2459800Z x = x_sign * x_clamp 2025-05-07T20:33:04.2460041Z x0 = x[:, :D] 2025-05-07T20:33:04.2460255Z x1 = x[:, D:] 2025-05-07T20:33:04.2460461Z 2025-05-07T20:33:04.2460646Z if contiguous: 2025-05-07T20:33:04.2460873Z x0 = x0.contiguous() 2025-05-07T20:33:04.2461133Z x1 = x1.contiguous() 2025-05-07T20:33:04.2461377Z 2025-05-07T20:33:04.2461564Z if scale_ub is not None: 2025-05-07T20:33:04.2461999Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.2462343Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.2462651Z ) 2025-05-07T20:33:04.2462843Z else: 2025-05-07T20:33:04.2463057Z scale_ub_tensor = None 2025-05-07T20:33:04.2463426Z 2025-05-07T20:33:04.2463656Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.2463973Z op = silu_mul_quant 2025-05-07T20:33:04.2464225Z if compiled: 2025-05-07T20:33:04.2464467Z op = torch.compile(op) 2025-05-07T20:33:04.2464767Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.2465047Z 2025-05-07T20:33:04.2465235Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.2465405Z 2025-05-07T20:33:04.2465501Z moe/activation_test.py:117: 2025-05-07T20:33:04.2465798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.2466130Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.2466419Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.2467141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.2467870Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.2468423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.2469140Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.2469832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.2470381Z kernel = self.compile( 2025-05-07T20:33:04.2470942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.2471627Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.2472038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.2472277Z 2025-05-07T20:33:04.2472485Z self = 2025-05-07T20:33:04.2473605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.2475029Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9943d379c0>} 2025-05-07T20:33:04.2476436Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.2477516Z context = 2025-05-07T20:33:04.2477812Z 2025-05-07T20:33:04.2477980Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.2478514Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.2478994Z module_map=module_map) 2025-05-07T20:33:04.2479359Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.2479716Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.2479976Z E ^ 2025-05-07T20:33:04.2480454Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.2480924Z 2025-05-07T20:33:04.2481361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.2481908Z 2025-05-07T20:33:04.2482011Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.2482547Z self=, 2025-05-07T20:33:04.2482957Z T=4096, 2025-05-07T20:33:04.2483142Z D=5120, 2025-05-07T20:33:04.2483339Z scale_ub=1200.0, 2025-05-07T20:33:04.2483557Z contiguous=True, 2025-05-07T20:33:04.2483768Z compiled=False, 2025-05-07T20:33:04.2484071Z ) 2025-05-07T20:33:04.2484391Z self = 2025-05-07T20:33:04.2484892Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:04.2485178Z 2025-05-07T20:33:04.2485255Z @given( 2025-05-07T20:33:04.2485483Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.2485792Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.2486099Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.2486429Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.2486751Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.2487036Z ) 2025-05-07T20:33:04.2487391Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.2487842Z def test_silu_mul_quant( 2025-05-07T20:33:04.2488075Z self, 2025-05-07T20:33:04.2488268Z T: int, 2025-05-07T20:33:04.2488462Z D: int, 2025-05-07T20:33:04.2488670Z scale_ub: Optional[float], 2025-05-07T20:33:04.2488936Z contiguous: bool, 2025-05-07T20:33:04.2489171Z compiled: bool, 2025-05-07T20:33:04.2489385Z ) -> None: 2025-05-07T20:33:04.2489595Z torch.manual_seed(2025) 2025-05-07T20:33:04.2489834Z 2025-05-07T20:33:04.2490100Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.2490450Z 2025-05-07T20:33:04.2490640Z x_sign = torch.sign(x) 2025-05-07T20:33:04.2490923Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.2491230Z x = x_sign * x_clamp 2025-05-07T20:33:04.2491461Z x0 = x[:, :D] 2025-05-07T20:33:04.2491673Z x1 = x[:, D:] 2025-05-07T20:33:04.2491876Z 2025-05-07T20:33:04.2492062Z if contiguous: 2025-05-07T20:33:04.2492287Z x0 = x0.contiguous() 2025-05-07T20:33:04.2492538Z x1 = x1.contiguous() 2025-05-07T20:33:04.2492785Z 2025-05-07T20:33:04.2492979Z if scale_ub is not None: 2025-05-07T20:33:04.2493243Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.2493577Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.2493887Z ) 2025-05-07T20:33:04.2494073Z else: 2025-05-07T20:33:04.2494280Z scale_ub_tensor = None 2025-05-07T20:33:04.2494591Z 2025-05-07T20:33:04.2494815Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.2495131Z op = silu_mul_quant 2025-05-07T20:33:04.2495381Z if compiled: 2025-05-07T20:33:04.2495621Z op = torch.compile(op) 2025-05-07T20:33:04.2495918Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.2496197Z 2025-05-07T20:33:04.2496378Z > y_fp8, y_scale = fn() 2025-05-07T20:33:04.2496547Z 2025-05-07T20:33:04.2496646Z moe/activation_test.py:117: 2025-05-07T20:33:04.2496960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.2497338Z moe/activation_test.py:115: in fn 2025-05-07T20:33:04.2497617Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.2498334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:04.2499057Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:04.2499610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.2500329Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.2501118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.2501683Z kernel = self.compile( 2025-05-07T20:33:04.2502248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.2502941Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.2503428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.2503667Z 2025-05-07T20:33:04.2503881Z self = 2025-05-07T20:33:04.2505001Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.2506430Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f99482e2520>} 2025-05-07T20:33:04.2507887Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.2508978Z context = 2025-05-07T20:33:04.2509278Z 2025-05-07T20:33:04.2509448Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.2509985Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.2510467Z module_map=module_map) 2025-05-07T20:33:04.2510841Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.2511201Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.2511469Z E ^ 2025-05-07T20:33:04.2517318Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.2517825Z 2025-05-07T20:33:04.2518275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.2518829Z 2025-05-07T20:33:04.2518933Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.2519364Z self=, 2025-05-07T20:33:04.2519782Z T=1, 2025-05-07T20:33:04.2519967Z D=5120, 2025-05-07T20:33:04.2520166Z scale_ub=None, 2025-05-07T20:33:04.2520383Z contiguous=True, 2025-05-07T20:33:04.2520601Z compiled=True, 2025-05-07T20:33:04.2520809Z ) 2025-05-07T20:33:04.4884699Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:33:04.4885813Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:33:04.4887230Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:33:04.4888787Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:33:04.4889807Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:04.4891179Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:33:04.4892819Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.4893859Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:04.4895385Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:04.4896841Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.4897971Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:04.4899326Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:33:04.4900651Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:33:04.4901938Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:33:04.4903219Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:33:04.4904097Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:04.4905177Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:33:04.4906251Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:33:04.4907095Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^ 2025-05-07T20:33:04.4908381Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:33:04.4909739Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:33:04.4910930Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:33:04.4912028Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:33:04.4913287Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:33:04.4914729Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:33:04.4915847Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.4916890Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.4917666Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:33:04.4918871Z W0507 20:33:04.485000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.5581855Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:33:04.5582955Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last): 2025-05-07T20:33:04.5584364Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:33:04.5585857Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:33:04.5586883Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:04.5588307Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:33:04.5589775Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.5590807Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:04.5592095Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:04.5593549Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.5594664Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:04.5596025Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:33:04.5597341Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] generator.visit(fn.parse()) 2025-05-07T20:33:04.5598628Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:33:04.5599898Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ret = super().visit(node) 2025-05-07T20:33:04.5600770Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:04.5602031Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:33:04.5603116Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] return visitor(node) 2025-05-07T20:33:04.5604051Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^^^^^^^^^^^^^ 2025-05-07T20:33:04.5605335Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:33:04.5606692Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:33:04.5607927Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:33:04.5609022Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] self.visit(item) 2025-05-07T20:33:04.5610273Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:33:04.5611711Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:33:04.5612829Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.5613783Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant( 2025-05-07T20:33:04.5614638Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^ 2025-05-07T20:33:04.5615711Z W0507 20:33:04.555000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.8574454Z self = 2025-05-07T20:33:04.8574961Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:04.8575266Z 2025-05-07T20:33:04.8575349Z @given( 2025-05-07T20:33:04.8575594Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:04.8575917Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:04.8576230Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:04.8576569Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:04.8576907Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:04.8577201Z ) 2025-05-07T20:33:04.8577559Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:04.8578019Z def test_silu_mul_quant( 2025-05-07T20:33:04.8578260Z self, 2025-05-07T20:33:04.8578457Z T: int, 2025-05-07T20:33:04.8578650Z D: int, 2025-05-07T20:33:04.8578869Z scale_ub: Optional[float], 2025-05-07T20:33:04.8579138Z contiguous: bool, 2025-05-07T20:33:04.8579379Z compiled: bool, 2025-05-07T20:33:04.8579606Z ) -> None: 2025-05-07T20:33:04.8579820Z torch.manual_seed(2025) 2025-05-07T20:33:04.8580067Z 2025-05-07T20:33:04.8580349Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:04.8580704Z 2025-05-07T20:33:04.8580906Z x_sign = torch.sign(x) 2025-05-07T20:33:04.8581201Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:04.8581517Z x = x_sign * x_clamp 2025-05-07T20:33:04.8581915Z x0 = x[:, :D] 2025-05-07T20:33:04.8582142Z x1 = x[:, D:] 2025-05-07T20:33:04.8582347Z 2025-05-07T20:33:04.8582543Z if contiguous: 2025-05-07T20:33:04.8582778Z x0 = x0.contiguous() 2025-05-07T20:33:04.8583035Z x1 = x1.contiguous() 2025-05-07T20:33:04.8583396Z 2025-05-07T20:33:04.8583595Z if scale_ub is not None: 2025-05-07T20:33:04.8583878Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:04.8584223Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:04.8584545Z ) 2025-05-07T20:33:04.8584745Z else: 2025-05-07T20:33:04.8584956Z scale_ub_tensor = None 2025-05-07T20:33:04.8585216Z 2025-05-07T20:33:04.8585453Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.8585769Z op = silu_mul_quant 2025-05-07T20:33:04.8586021Z if compiled: 2025-05-07T20:33:04.8586273Z op = torch.compile(op) 2025-05-07T20:33:04.8586580Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:04.8586862Z 2025-05-07T20:33:04.8587056Z y_fp8, y_scale = fn() 2025-05-07T20:33:04.8587343Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:04.8587646Z 2025-05-07T20:33:04.8587889Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:04.8588228Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:04.8588536Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:04.8588861Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:04.8589234Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:04.8589547Z 2025-05-07T20:33:04.8589748Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:04.8589949Z 2025-05-07T20:33:04.8590050Z moe/activation_test.py:126: 2025-05-07T20:33:04.8590349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.8590698Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:04.8591037Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:04.8591860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:04.8592652Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:04.8593224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:04.8593945Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:04.8594662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:04.8595434Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:04.8596210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:04.8596892Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:04.8597528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:04.8598087Z fn() 2025-05-07T20:33:04.8598635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:04.8599262Z self.fn.run( 2025-05-07T20:33:04.8599753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:04.8600316Z kernel = self.compile( 2025-05-07T20:33:04.8600887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:04.8601578Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:04.8601993Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:04.8602230Z 2025-05-07T20:33:04.8602540Z self = 2025-05-07T20:33:04.8603675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:04.8605172Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f99484c7d80>} 2025-05-07T20:33:04.8606581Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:04.8607674Z context = 2025-05-07T20:33:04.8607976Z 2025-05-07T20:33:04.8608164Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:04.8608708Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:04.8609199Z module_map=module_map) 2025-05-07T20:33:04.8609575Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:04.8609959Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:04.8610235Z E ^ 2025-05-07T20:33:04.8610716Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:04.8611192Z 2025-05-07T20:33:04.8611637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:04.8612181Z 2025-05-07T20:33:04.8612296Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:04.8612725Z self=, 2025-05-07T20:33:04.8613150Z T=2048, 2025-05-07T20:33:04.8613359Z D=5120, 2025-05-07T20:33:04.8613555Z scale_ub=None, 2025-05-07T20:33:04.8613778Z contiguous=True, 2025-05-07T20:33:04.8614002Z compiled=True, 2025-05-07T20:33:04.8614204Z ) 2025-05-07T20:33:05.0870189Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:33:05.0871321Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:33:05.0872730Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:33:05.0874243Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:33:05.0875267Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.0876652Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:33:05.0878122Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.0879157Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.0880628Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:05.0882101Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.0883337Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.0884702Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:33:05.0886032Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:33:05.0887326Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:33:05.0888617Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:33:05.0889496Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.0890585Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:33:05.0891658Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:33:05.0892508Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^ 2025-05-07T20:33:05.0893785Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:33:05.0895222Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:33:05.0896393Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:33:05.0897502Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:33:05.0898810Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:33:05.0900248Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:33:05.0901372Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.0902319Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.0903104Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:33:05.0904295Z W0507 20:33:05.084000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.1556905Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:33:05.1558214Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last): 2025-05-07T20:33:05.1559620Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:33:05.1561111Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:33:05.1562143Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.1563518Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:33:05.1564985Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.1566018Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.1567310Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:05.1568821Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.1569947Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.1571302Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:33:05.1572623Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] generator.visit(fn.parse()) 2025-05-07T20:33:05.1573912Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:33:05.1575248Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ret = super().visit(node) 2025-05-07T20:33:05.1576123Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.1577208Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:33:05.1578281Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] return visitor(node) 2025-05-07T20:33:05.1579113Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^^^^^^^^^^^^^ 2025-05-07T20:33:05.1580516Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:33:05.1581877Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:33:05.1583128Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:33:05.1584233Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] self.visit(item) 2025-05-07T20:33:05.1585473Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:33:05.1586903Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:33:05.1588066Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.1589021Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.1589791Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^ 2025-05-07T20:33:05.1590858Z W0507 20:33:05.152000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.4548282Z self = 2025-05-07T20:33:05.4548850Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:05.4549133Z 2025-05-07T20:33:05.4549224Z @given( 2025-05-07T20:33:05.4549455Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:05.4549779Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:05.4550088Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:05.4550422Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:05.4550749Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:05.4551036Z ) 2025-05-07T20:33:05.4551385Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:05.4551833Z def test_silu_mul_quant( 2025-05-07T20:33:05.4552082Z self, 2025-05-07T20:33:05.4552278Z T: int, 2025-05-07T20:33:05.4552468Z D: int, 2025-05-07T20:33:05.4552688Z scale_ub: Optional[float], 2025-05-07T20:33:05.4552960Z contiguous: bool, 2025-05-07T20:33:05.4553203Z compiled: bool, 2025-05-07T20:33:05.4553424Z ) -> None: 2025-05-07T20:33:05.4553637Z torch.manual_seed(2025) 2025-05-07T20:33:05.4553868Z 2025-05-07T20:33:05.4554141Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:05.4554502Z 2025-05-07T20:33:05.4554689Z x_sign = torch.sign(x) 2025-05-07T20:33:05.4554978Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:05.4555288Z x = x_sign * x_clamp 2025-05-07T20:33:05.4555527Z x0 = x[:, :D] 2025-05-07T20:33:05.4555737Z x1 = x[:, D:] 2025-05-07T20:33:05.4555946Z 2025-05-07T20:33:05.4556132Z if contiguous: 2025-05-07T20:33:05.4556360Z x0 = x0.contiguous() 2025-05-07T20:33:05.4556619Z x1 = x1.contiguous() 2025-05-07T20:33:05.4556866Z 2025-05-07T20:33:05.4557049Z if scale_ub is not None: 2025-05-07T20:33:05.4557322Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:05.4557823Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:05.4558137Z ) 2025-05-07T20:33:05.4558329Z else: 2025-05-07T20:33:05.4558543Z scale_ub_tensor = None 2025-05-07T20:33:05.4558904Z 2025-05-07T20:33:05.4559136Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.4559460Z op = silu_mul_quant 2025-05-07T20:33:05.4559705Z if compiled: 2025-05-07T20:33:05.4559953Z op = torch.compile(op) 2025-05-07T20:33:05.4560251Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:05.4560537Z 2025-05-07T20:33:05.4560721Z y_fp8, y_scale = fn() 2025-05-07T20:33:05.4561006Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:05.4561301Z 2025-05-07T20:33:05.4561529Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:05.4561866Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:05.4562168Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:05.4562479Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:05.4562842Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.4563158Z 2025-05-07T20:33:05.4563358Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:05.4563563Z 2025-05-07T20:33:05.4563663Z moe/activation_test.py:126: 2025-05-07T20:33:05.4563965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.4564311Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:05.4564636Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:05.4565457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:05.4566243Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:05.4566809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:05.4567522Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:05.4568244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:05.4569006Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:05.4569764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:05.4570434Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:05.4571061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:05.4571605Z fn() 2025-05-07T20:33:05.4572130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:05.4572743Z self.fn.run( 2025-05-07T20:33:05.4573227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:05.4573775Z kernel = self.compile( 2025-05-07T20:33:05.4574439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:05.4575139Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.4575549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:05.4575787Z 2025-05-07T20:33:05.4575996Z self = 2025-05-07T20:33:05.4577120Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:05.4578682Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f994384f060>} 2025-05-07T20:33:05.4580091Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:05.4581274Z context = 2025-05-07T20:33:05.4581576Z 2025-05-07T20:33:05.4581748Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:05.4582294Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.4582779Z module_map=module_map) 2025-05-07T20:33:05.4583149Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.4583520Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:05.4583797Z E ^ 2025-05-07T20:33:05.4584281Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.4584761Z 2025-05-07T20:33:05.4585198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:05.4585749Z 2025-05-07T20:33:05.4585854Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:05.4586287Z self=, 2025-05-07T20:33:05.4586701Z T=128, 2025-05-07T20:33:05.4586900Z D=5120, 2025-05-07T20:33:05.4587086Z scale_ub=None, 2025-05-07T20:33:05.4587297Z contiguous=True, 2025-05-07T20:33:05.4587555Z compiled=True, 2025-05-07T20:33:05.4587767Z ) 2025-05-07T20:33:05.7014920Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:33:05.7016022Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:33:05.7017444Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:33:05.7018949Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:33:05.7019964Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.7021339Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:33:05.7022797Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.7023829Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.7025118Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:05.7026730Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.7028003Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.7029367Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:33:05.7030795Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:33:05.7032085Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:33:05.7033364Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:33:05.7034233Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.7035317Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:33:05.7036399Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:33:05.7037234Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^ 2025-05-07T20:33:05.7038555Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:33:05.7039912Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:33:05.7041091Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:33:05.7042201Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:33:05.7043451Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:33:05.7044877Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:33:05.7045995Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.7046949Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.7047728Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:33:05.7048842Z W0507 20:33:05.698000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:05.7713093Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:33:05.7714199Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last): 2025-05-07T20:33:05.7721405Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:33:05.7723036Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:33:05.7724067Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.7725634Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:33:05.7727109Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:05.7728154Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.7729458Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:05.7730917Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:05.7732047Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.7733408Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:33:05.7734807Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] generator.visit(fn.parse()) 2025-05-07T20:33:05.7736097Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:33:05.7737375Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ret = super().visit(node) 2025-05-07T20:33:05.7738239Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:05.7739312Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:33:05.7740380Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] return visitor(node) 2025-05-07T20:33:05.7741215Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^^^^^^^^^^^^^ 2025-05-07T20:33:05.7742496Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:33:05.7743846Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:33:05.7745139Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:33:05.7746244Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] self.visit(item) 2025-05-07T20:33:05.7747595Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:33:05.7749028Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:33:05.7750140Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:05.7751103Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant( 2025-05-07T20:33:05.7751882Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^ 2025-05-07T20:33:05.7752961Z W0507 20:33:05.768000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.1169375Z self = 2025-05-07T20:33:06.1169903Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:06.1170189Z 2025-05-07T20:33:06.1170270Z @given( 2025-05-07T20:33:06.1170514Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.1170849Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.1171159Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.1171532Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.1171869Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.1172164Z ) 2025-05-07T20:33:06.1172518Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.1172983Z def test_silu_mul_quant( 2025-05-07T20:33:06.1173237Z self, 2025-05-07T20:33:06.1173433Z T: int, 2025-05-07T20:33:06.1173632Z D: int, 2025-05-07T20:33:06.1173852Z scale_ub: Optional[float], 2025-05-07T20:33:06.1174127Z contiguous: bool, 2025-05-07T20:33:06.1174499Z compiled: bool, 2025-05-07T20:33:06.1174735Z ) -> None: 2025-05-07T20:33:06.1174951Z torch.manual_seed(2025) 2025-05-07T20:33:06.1175198Z 2025-05-07T20:33:06.1175482Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.1175832Z 2025-05-07T20:33:06.1176033Z x_sign = torch.sign(x) 2025-05-07T20:33:06.1176341Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.1176664Z x = x_sign * x_clamp 2025-05-07T20:33:06.1176907Z x0 = x[:, :D] 2025-05-07T20:33:06.1177127Z x1 = x[:, D:] 2025-05-07T20:33:06.1177342Z 2025-05-07T20:33:06.1177526Z if contiguous: 2025-05-07T20:33:06.1177771Z x0 = x0.contiguous() 2025-05-07T20:33:06.1178041Z x1 = x1.contiguous() 2025-05-07T20:33:06.1178282Z 2025-05-07T20:33:06.1178477Z if scale_ub is not None: 2025-05-07T20:33:06.1178753Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.1179092Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.1179408Z ) 2025-05-07T20:33:06.1179607Z else: 2025-05-07T20:33:06.1179813Z scale_ub_tensor = None 2025-05-07T20:33:06.1180073Z 2025-05-07T20:33:06.1180307Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.1180626Z op = silu_mul_quant 2025-05-07T20:33:06.1181043Z if compiled: 2025-05-07T20:33:06.1181303Z op = torch.compile(op) 2025-05-07T20:33:06.1181607Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.1181888Z 2025-05-07T20:33:06.1182084Z y_fp8, y_scale = fn() 2025-05-07T20:33:06.1182526Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:06.1182820Z 2025-05-07T20:33:06.1183062Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.1183411Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:06.1183706Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:06.1184032Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:06.1184407Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:06.1184724Z 2025-05-07T20:33:06.1184930Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:06.1185131Z 2025-05-07T20:33:06.1185239Z moe/activation_test.py:126: 2025-05-07T20:33:06.1185555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.1185899Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:06.1186235Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:06.1187058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:06.1187852Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:06.1188418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.1189136Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.1189856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:06.1190609Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:06.1191381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:06.1192050Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:06.1192680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:06.1193227Z fn() 2025-05-07T20:33:06.1193765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:06.1194383Z self.fn.run( 2025-05-07T20:33:06.1194870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.1195434Z kernel = self.compile( 2025-05-07T20:33:06.1195997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.1196688Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.1197100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.1197349Z 2025-05-07T20:33:06.1197559Z self = 2025-05-07T20:33:06.1198687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.1200123Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9942aa9d00>} 2025-05-07T20:33:06.1201533Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.1202705Z context = 2025-05-07T20:33:06.1203021Z 2025-05-07T20:33:06.1203199Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.1203750Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.1204312Z module_map=module_map) 2025-05-07T20:33:06.1204696Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.1205069Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:06.1205348Z E ^ 2025-05-07T20:33:06.1205838Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.1206319Z 2025-05-07T20:33:06.1206758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.1207302Z 2025-05-07T20:33:06.1207411Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.1207891Z self=, 2025-05-07T20:33:06.1208319Z T=4096, 2025-05-07T20:33:06.1208507Z D=5120, 2025-05-07T20:33:06.1208708Z scale_ub=None, 2025-05-07T20:33:06.1208934Z contiguous=True, 2025-05-07T20:33:06.1209174Z compiled=True, 2025-05-07T20:33:06.1209383Z ) 2025-05-07T20:33:06.3640623Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:33:06.3642817Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:33:06.3645606Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:33:06.3648182Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:33:06.3649202Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:06.3650583Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:33:06.3652039Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.3653067Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:06.3654428Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:06.3655891Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.3657009Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:06.3658409Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:33:06.3659884Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:33:06.3661184Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:33:06.3662571Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:33:06.3663442Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:06.3664526Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:33:06.3665603Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:33:06.3666435Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^ 2025-05-07T20:33:06.3667720Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:33:06.3669076Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:33:06.3670244Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:33:06.3671349Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:33:06.3672591Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:33:06.3674027Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:33:06.3675140Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.3676094Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.3676872Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:33:06.3677949Z W0507 20:33:06.360000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.4340657Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated 2025-05-07T20:33:06.4341772Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last): 2025-05-07T20:33:06.4343163Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors 2025-05-07T20:33:06.4344818Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs) 2025-05-07T20:33:06.4345843Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:06.4347214Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir 2025-05-07T20:33:06.4348799Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ttir_module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.4349825Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:06.4351125Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:06.4352577Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.4353703Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:06.4355046Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir 2025-05-07T20:33:06.4356364Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] generator.visit(fn.parse()) 2025-05-07T20:33:06.4357655Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1201, in visit 2025-05-07T20:33:06.4358933Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ret = super().visit(node) 2025-05-07T20:33:06.4359803Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:06.4360882Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 407, in visit 2025-05-07T20:33:06.4361944Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] return visitor(node) 2025-05-07T20:33:06.4362780Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^^^^^^^^^^^^^ 2025-05-07T20:33:06.4364055Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module 2025-05-07T20:33:06.4365419Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ast.NodeVisitor.generic_visit(self, node) 2025-05-07T20:33:06.4366590Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/ast.py", line 415, in generic_visit 2025-05-07T20:33:06.4367687Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] self.visit(item) 2025-05-07T20:33:06.4369028Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/code_generator.py", line 1207, in visit 2025-05-07T20:33:06.4370474Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None 2025-05-07T20:33:06.4371775Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.4372728Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant( 2025-05-07T20:33:06.4373503Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^ 2025-05-07T20:33:06.4374717Z W0507 20:33:06.431000 95972 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.7764629Z self = 2025-05-07T20:33:06.7765159Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:06.7765459Z 2025-05-07T20:33:06.7765583Z @given( 2025-05-07T20:33:06.7765828Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.7766149Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.7766463Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.7766806Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.7767141Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.7767439Z ) 2025-05-07T20:33:06.7767826Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.7768304Z def test_silu_mul_quant( 2025-05-07T20:33:06.7768553Z self, 2025-05-07T20:33:06.7768754Z T: int, 2025-05-07T20:33:06.7768943Z D: int, 2025-05-07T20:33:06.7769173Z scale_ub: Optional[float], 2025-05-07T20:33:06.7769451Z contiguous: bool, 2025-05-07T20:33:06.7769699Z compiled: bool, 2025-05-07T20:33:06.7769942Z ) -> None: 2025-05-07T20:33:06.7770157Z torch.manual_seed(2025) 2025-05-07T20:33:06.7770414Z 2025-05-07T20:33:06.7770697Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.7771048Z 2025-05-07T20:33:06.7771243Z x_sign = torch.sign(x) 2025-05-07T20:33:06.7771539Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.7771856Z x = x_sign * x_clamp 2025-05-07T20:33:06.7772100Z x0 = x[:, :D] 2025-05-07T20:33:06.7772315Z x1 = x[:, D:] 2025-05-07T20:33:06.7772521Z 2025-05-07T20:33:06.7772708Z if contiguous: 2025-05-07T20:33:06.7772940Z x0 = x0.contiguous() 2025-05-07T20:33:06.7773199Z x1 = x1.contiguous() 2025-05-07T20:33:06.7773442Z 2025-05-07T20:33:06.7773640Z if scale_ub is not None: 2025-05-07T20:33:06.7773911Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.7774249Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.7774678Z ) 2025-05-07T20:33:06.7774876Z else: 2025-05-07T20:33:06.7775090Z scale_ub_tensor = None 2025-05-07T20:33:06.7775349Z 2025-05-07T20:33:06.7775583Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.7775899Z op = silu_mul_quant 2025-05-07T20:33:06.7776153Z if compiled: 2025-05-07T20:33:06.7776404Z op = torch.compile(op) 2025-05-07T20:33:06.7776704Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.7776984Z 2025-05-07T20:33:06.7777181Z y_fp8, y_scale = fn() 2025-05-07T20:33:06.7777467Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:06.7777768Z 2025-05-07T20:33:06.7778005Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.7778501Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:06.7778814Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:06.7779146Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:06.7779638Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:06.7779956Z 2025-05-07T20:33:06.7780160Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:06.7780359Z 2025-05-07T20:33:06.7780464Z moe/activation_test.py:126: 2025-05-07T20:33:06.7780766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.7781114Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:06.7781453Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:06.7782274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:06.7783080Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:06.7783655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.7784376Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.7785100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:06.7785865Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:06.7786643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:06.7787316Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:06.7788175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:06.7788732Z fn() 2025-05-07T20:33:06.7789274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:06.7789887Z self.fn.run( 2025-05-07T20:33:06.7790387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.7790956Z kernel = self.compile( 2025-05-07T20:33:06.7791523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.7792207Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.7792628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.7792871Z 2025-05-07T20:33:06.7793089Z self = 2025-05-07T20:33:06.7794225Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.7795645Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9942710ae0>} 2025-05-07T20:33:06.7797060Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.7798151Z context = 2025-05-07T20:33:06.7798452Z 2025-05-07T20:33:06.7798629Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.7799170Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.7799660Z module_map=module_map) 2025-05-07T20:33:06.7800035Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.7800494Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:06.7800766Z E ^ 2025-05-07T20:33:06.7801242Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.7801796Z 2025-05-07T20:33:06.7802244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.7802790Z 2025-05-07T20:33:06.7802898Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.7803336Z self=, 2025-05-07T20:33:06.7803766Z T=16384, 2025-05-07T20:33:06.7803966Z D=5120, 2025-05-07T20:33:06.7804157Z scale_ub=None, 2025-05-07T20:33:06.7804369Z contiguous=True, 2025-05-07T20:33:06.7804597Z compiled=True, 2025-05-07T20:33:06.7804796Z ) 2025-05-07T20:33:06.8069059Z W0507 20:33:06.805000 95972 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:33:06.8070369Z W0507 20:33:06.805000 95972 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:33:06.8071781Z W0507 20:33:06.805000 95972 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:33:06.8072821Z W0507 20:33:06.805000 95972 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:33:06.8073986Z W0507 20:33:06.805000 95972 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:33:06.8958378Z self = 2025-05-07T20:33:06.8958957Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:06.8959242Z 2025-05-07T20:33:06.8959330Z @given( 2025-05-07T20:33:06.8959564Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:06.8959886Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:06.8960207Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:06.8960555Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:06.8960893Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:06.8961194Z ) 2025-05-07T20:33:06.8961552Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:06.8962006Z def test_silu_mul_quant( 2025-05-07T20:33:06.8962251Z self, 2025-05-07T20:33:06.8962442Z T: int, 2025-05-07T20:33:06.8962630Z D: int, 2025-05-07T20:33:06.8962846Z scale_ub: Optional[float], 2025-05-07T20:33:06.8963114Z contiguous: bool, 2025-05-07T20:33:06.8963353Z compiled: bool, 2025-05-07T20:33:06.8963580Z ) -> None: 2025-05-07T20:33:06.8963789Z torch.manual_seed(2025) 2025-05-07T20:33:06.8964025Z 2025-05-07T20:33:06.8964303Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:06.8964662Z 2025-05-07T20:33:06.8964850Z x_sign = torch.sign(x) 2025-05-07T20:33:06.8965143Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:06.8965456Z x = x_sign * x_clamp 2025-05-07T20:33:06.8965696Z x0 = x[:, :D] 2025-05-07T20:33:06.8965905Z x1 = x[:, D:] 2025-05-07T20:33:06.8966112Z 2025-05-07T20:33:06.8966294Z if contiguous: 2025-05-07T20:33:06.8966517Z x0 = x0.contiguous() 2025-05-07T20:33:06.8966772Z x1 = x1.contiguous() 2025-05-07T20:33:06.8967014Z 2025-05-07T20:33:06.8967198Z if scale_ub is not None: 2025-05-07T20:33:06.8967470Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:06.8967971Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:06.8968281Z ) 2025-05-07T20:33:06.8968476Z else: 2025-05-07T20:33:06.8968683Z scale_ub_tensor = None 2025-05-07T20:33:06.8968929Z 2025-05-07T20:33:06.8969160Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.8969606Z op = silu_mul_quant 2025-05-07T20:33:06.8969861Z if compiled: 2025-05-07T20:33:06.8970111Z op = torch.compile(op) 2025-05-07T20:33:06.8970422Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:06.8970712Z 2025-05-07T20:33:06.8970907Z y_fp8, y_scale = fn() 2025-05-07T20:33:06.8971205Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:06.8971507Z 2025-05-07T20:33:06.8971743Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:06.8972087Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:06.8972392Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:06.8972722Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:06.8973097Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:06.8973420Z 2025-05-07T20:33:06.8973622Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:06.8973839Z 2025-05-07T20:33:06.8973939Z moe/activation_test.py:126: 2025-05-07T20:33:06.8974250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.8974697Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:06.8975031Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:06.8975858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:06.8976660Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:06.8977228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:06.8977953Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:06.8978677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:06.8979452Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:06.8980220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:06.8980894Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:06.8981529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:06.8982078Z fn() 2025-05-07T20:33:06.8982608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:06.8983221Z self.fn.run( 2025-05-07T20:33:06.8983711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:06.8984267Z kernel = self.compile( 2025-05-07T20:33:06.8984836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:06.8985538Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:06.8991272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:06.8991543Z 2025-05-07T20:33:06.8991761Z self = 2025-05-07T20:33:06.8992902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:06.8994478Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98578380e0>} 2025-05-07T20:33:06.8995899Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:06.8997061Z context = 2025-05-07T20:33:06.8997371Z 2025-05-07T20:33:06.8997543Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:06.8998093Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:06.8998589Z module_map=module_map) 2025-05-07T20:33:06.8998963Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:06.8999333Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:06.8999603Z E ^ 2025-05-07T20:33:06.9000084Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:06.9000562Z 2025-05-07T20:33:06.9001001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:06.9001568Z 2025-05-07T20:33:06.9001673Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:06.9002111Z self=, 2025-05-07T20:33:06.9002526Z T=1, 2025-05-07T20:33:06.9002711Z D=5120, 2025-05-07T20:33:06.9002911Z scale_ub=1200.0, 2025-05-07T20:33:06.9003136Z contiguous=True, 2025-05-07T20:33:06.9003362Z compiled=True, 2025-05-07T20:33:06.9003580Z ) 2025-05-07T20:33:07.0405003Z self = 2025-05-07T20:33:07.0406083Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.0406644Z 2025-05-07T20:33:07.0406804Z @given( 2025-05-07T20:33:07.0407303Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.0407844Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.0408204Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.0408549Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.0408899Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.0409188Z ) 2025-05-07T20:33:07.0409547Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.0410007Z def test_silu_mul_quant( 2025-05-07T20:33:07.0410242Z self, 2025-05-07T20:33:07.0410441Z T: int, 2025-05-07T20:33:07.0410636Z D: int, 2025-05-07T20:33:07.0410847Z scale_ub: Optional[float], 2025-05-07T20:33:07.0411124Z contiguous: bool, 2025-05-07T20:33:07.0411361Z compiled: bool, 2025-05-07T20:33:07.0411579Z ) -> None: 2025-05-07T20:33:07.0411804Z torch.manual_seed(2025) 2025-05-07T20:33:07.0412047Z 2025-05-07T20:33:07.0412334Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.0412689Z 2025-05-07T20:33:07.0412884Z x_sign = torch.sign(x) 2025-05-07T20:33:07.0413174Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.0413488Z x = x_sign * x_clamp 2025-05-07T20:33:07.0413727Z x0 = x[:, :D] 2025-05-07T20:33:07.0413942Z x1 = x[:, D:] 2025-05-07T20:33:07.0414146Z 2025-05-07T20:33:07.0414381Z if contiguous: 2025-05-07T20:33:07.0414620Z x0 = x0.contiguous() 2025-05-07T20:33:07.0414878Z x1 = x1.contiguous() 2025-05-07T20:33:07.0415125Z 2025-05-07T20:33:07.0415319Z if scale_ub is not None: 2025-05-07T20:33:07.0415594Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.0415932Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.0416251Z ) 2025-05-07T20:33:07.0416442Z else: 2025-05-07T20:33:07.0416831Z scale_ub_tensor = None 2025-05-07T20:33:07.0417098Z 2025-05-07T20:33:07.0417332Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.0417653Z op = silu_mul_quant 2025-05-07T20:33:07.0417908Z if compiled: 2025-05-07T20:33:07.0418283Z op = torch.compile(op) 2025-05-07T20:33:07.0418581Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.0418867Z 2025-05-07T20:33:07.0419063Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.0419228Z 2025-05-07T20:33:07.0419325Z moe/activation_test.py:117: 2025-05-07T20:33:07.0419626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.0419970Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.0420251Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.0420841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.0421449Z return fn(*args, **kwargs) 2025-05-07T20:33:07.0422144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.0422859Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.0423425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.0424145Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.0424838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.0425544Z kernel = self.compile( 2025-05-07T20:33:07.0426114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.0426811Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.0427222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.0427476Z 2025-05-07T20:33:07.0427707Z self = 2025-05-07T20:33:07.0428869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.0430302Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9942710180>} 2025-05-07T20:33:07.0431711Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.0432792Z context = 2025-05-07T20:33:07.0433095Z 2025-05-07T20:33:07.0433268Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.0433810Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.0434287Z module_map=module_map) 2025-05-07T20:33:07.0434662Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.0435019Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.0435285Z E ^ 2025-05-07T20:33:07.0435755Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.0436233Z 2025-05-07T20:33:07.0442156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.0442705Z 2025-05-07T20:33:07.0442824Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.0443247Z self=, 2025-05-07T20:33:07.0443810Z T=1, 2025-05-07T20:33:07.0444006Z D=5120, 2025-05-07T20:33:07.0444200Z scale_ub=None, 2025-05-07T20:33:07.0444416Z contiguous=False, 2025-05-07T20:33:07.0444649Z compiled=True, 2025-05-07T20:33:07.0444853Z ) 2025-05-07T20:33:07.1058836Z self = 2025-05-07T20:33:07.1059882Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.1060433Z 2025-05-07T20:33:07.1060589Z @given( 2025-05-07T20:33:07.1061049Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.1061677Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.1062344Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.1063033Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.1063709Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.1064289Z ) 2025-05-07T20:33:07.1065010Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.1065923Z def test_silu_mul_quant( 2025-05-07T20:33:07.1066401Z self, 2025-05-07T20:33:07.1066792Z T: int, 2025-05-07T20:33:07.1067186Z D: int, 2025-05-07T20:33:07.1067598Z scale_ub: Optional[float], 2025-05-07T20:33:07.1067879Z contiguous: bool, 2025-05-07T20:33:07.1068131Z compiled: bool, 2025-05-07T20:33:07.1068350Z ) -> None: 2025-05-07T20:33:07.1068556Z torch.manual_seed(2025) 2025-05-07T20:33:07.1068800Z 2025-05-07T20:33:07.1069075Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.1069420Z 2025-05-07T20:33:07.1069613Z x_sign = torch.sign(x) 2025-05-07T20:33:07.1069901Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.1070215Z x = x_sign * x_clamp 2025-05-07T20:33:07.1070448Z x0 = x[:, :D] 2025-05-07T20:33:07.1070661Z x1 = x[:, D:] 2025-05-07T20:33:07.1070864Z 2025-05-07T20:33:07.1071049Z if contiguous: 2025-05-07T20:33:07.1071279Z x0 = x0.contiguous() 2025-05-07T20:33:07.1071533Z x1 = x1.contiguous() 2025-05-07T20:33:07.1071776Z 2025-05-07T20:33:07.1071970Z if scale_ub is not None: 2025-05-07T20:33:07.1072246Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.1072583Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.1072901Z ) 2025-05-07T20:33:07.1073093Z else: 2025-05-07T20:33:07.1073299Z scale_ub_tensor = None 2025-05-07T20:33:07.1073553Z 2025-05-07T20:33:07.1073783Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.1074104Z op = silu_mul_quant 2025-05-07T20:33:07.1074361Z if compiled: 2025-05-07T20:33:07.1074608Z op = torch.compile(op) 2025-05-07T20:33:07.1074907Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.1075187Z 2025-05-07T20:33:07.1075379Z y_fp8, y_scale = fn() 2025-05-07T20:33:07.1075658Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:07.1075956Z 2025-05-07T20:33:07.1076197Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.1076529Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:07.1076832Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:07.1077153Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:07.1077522Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.1077856Z 2025-05-07T20:33:07.1078084Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:07.1078420Z 2025-05-07T20:33:07.1078526Z moe/activation_test.py:126: 2025-05-07T20:33:07.1078823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.1079167Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:07.1079502Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:07.1080473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:07.1081269Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:07.1081839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.1082621Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.1083343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:07.1084100Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:07.1084870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:07.1085544Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:07.1086178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:07.1086729Z fn() 2025-05-07T20:33:07.1087264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:07.1087887Z self.fn.run( 2025-05-07T20:33:07.1088405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.1088967Z kernel = self.compile( 2025-05-07T20:33:07.1089529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.1090211Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.1090625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.1090864Z 2025-05-07T20:33:07.1091071Z self = 2025-05-07T20:33:07.1092207Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.1093634Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f994243b060>} 2025-05-07T20:33:07.1095159Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.1096245Z context = 2025-05-07T20:33:07.1096541Z 2025-05-07T20:33:07.1096712Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.1097248Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.1097731Z module_map=module_map) 2025-05-07T20:33:07.1098103Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.1098464Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:07.1098732Z E ^ 2025-05-07T20:33:07.1099209Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.1099679Z 2025-05-07T20:33:07.1100122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.1100664Z 2025-05-07T20:33:07.1100860Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.1101275Z self=, 2025-05-07T20:33:07.1101685Z T=1, 2025-05-07T20:33:07.1101882Z D=5120, 2025-05-07T20:33:07.1102073Z scale_ub=None, 2025-05-07T20:33:07.1102287Z contiguous=True, 2025-05-07T20:33:07.1102596Z compiled=False, 2025-05-07T20:33:07.1102793Z ) 2025-05-07T20:33:07.2613895Z self = 2025-05-07T20:33:07.2614505Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:07.2614893Z 2025-05-07T20:33:07.2614986Z @given( 2025-05-07T20:33:07.2615217Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.2615541Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.2615856Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.2616191Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.2616535Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.2616831Z ) 2025-05-07T20:33:07.2617190Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.2617636Z def test_silu_mul_quant( 2025-05-07T20:33:07.2617883Z self, 2025-05-07T20:33:07.2618086Z T: int, 2025-05-07T20:33:07.2618279Z D: int, 2025-05-07T20:33:07.2618504Z scale_ub: Optional[float], 2025-05-07T20:33:07.2618785Z contiguous: bool, 2025-05-07T20:33:07.2619022Z compiled: bool, 2025-05-07T20:33:07.2619253Z ) -> None: 2025-05-07T20:33:07.2619469Z torch.manual_seed(2025) 2025-05-07T20:33:07.2619710Z 2025-05-07T20:33:07.2619986Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.2620338Z 2025-05-07T20:33:07.2620533Z x_sign = torch.sign(x) 2025-05-07T20:33:07.2620832Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.2621153Z x = x_sign * x_clamp 2025-05-07T20:33:07.2621389Z x0 = x[:, :D] 2025-05-07T20:33:07.2621610Z x1 = x[:, D:] 2025-05-07T20:33:07.2621822Z 2025-05-07T20:33:07.2622015Z if contiguous: 2025-05-07T20:33:07.2622246Z x0 = x0.contiguous() 2025-05-07T20:33:07.2622502Z x1 = x1.contiguous() 2025-05-07T20:33:07.2622748Z 2025-05-07T20:33:07.2622940Z if scale_ub is not None: 2025-05-07T20:33:07.2623217Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.2623560Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.2623877Z ) 2025-05-07T20:33:07.2624069Z else: 2025-05-07T20:33:07.2624288Z scale_ub_tensor = None 2025-05-07T20:33:07.2624540Z 2025-05-07T20:33:07.2624775Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.2625098Z op = silu_mul_quant 2025-05-07T20:33:07.2625347Z if compiled: 2025-05-07T20:33:07.2625765Z op = torch.compile(op) 2025-05-07T20:33:07.2626081Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.2626358Z 2025-05-07T20:33:07.2626556Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.2626732Z 2025-05-07T20:33:07.2626832Z moe/activation_test.py:117: 2025-05-07T20:33:07.2627141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.2627480Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.2627774Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.2628494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.2629222Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.2629784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.2630505Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.2631285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.2631843Z kernel = self.compile( 2025-05-07T20:33:07.2632405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.2633208Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.2633617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.2633857Z 2025-05-07T20:33:07.2634069Z self = 2025-05-07T20:33:07.2635255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.2636682Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f994243ba60>} 2025-05-07T20:33:07.2638103Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.2639184Z context = 2025-05-07T20:33:07.2639492Z 2025-05-07T20:33:07.2639661Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.2640206Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.2640693Z module_map=module_map) 2025-05-07T20:33:07.2641061Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.2641423Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.2641692Z E ^ 2025-05-07T20:33:07.2642166Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.2642643Z 2025-05-07T20:33:07.2643080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.2643626Z 2025-05-07T20:33:07.2643738Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.2644167Z self=, 2025-05-07T20:33:07.2644580Z T=128, 2025-05-07T20:33:07.2644771Z D=5120, 2025-05-07T20:33:07.2644967Z scale_ub=None, 2025-05-07T20:33:07.2645180Z contiguous=False, 2025-05-07T20:33:07.2645407Z compiled=True, 2025-05-07T20:33:07.2645613Z ) 2025-05-07T20:33:07.2645934Z self = 2025-05-07T20:33:07.2646462Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:07.2646746Z 2025-05-07T20:33:07.2646828Z @given( 2025-05-07T20:33:07.2647062Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.2647381Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.2647690Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.2648074Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.2648415Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.2648705Z ) 2025-05-07T20:33:07.2649056Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.2649515Z def test_silu_mul_quant( 2025-05-07T20:33:07.2649757Z self, 2025-05-07T20:33:07.2649955Z T: int, 2025-05-07T20:33:07.2650156Z D: int, 2025-05-07T20:33:07.2650369Z scale_ub: Optional[float], 2025-05-07T20:33:07.2650650Z contiguous: bool, 2025-05-07T20:33:07.2650890Z compiled: bool, 2025-05-07T20:33:07.2651111Z ) -> None: 2025-05-07T20:33:07.2651330Z torch.manual_seed(2025) 2025-05-07T20:33:07.2651639Z 2025-05-07T20:33:07.2651925Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.2652280Z 2025-05-07T20:33:07.2652485Z x_sign = torch.sign(x) 2025-05-07T20:33:07.2652785Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.2653184Z x = x_sign * x_clamp 2025-05-07T20:33:07.2653436Z x0 = x[:, :D] 2025-05-07T20:33:07.2653664Z x1 = x[:, D:] 2025-05-07T20:33:07.2653882Z 2025-05-07T20:33:07.2654083Z if contiguous: 2025-05-07T20:33:07.2654327Z x0 = x0.contiguous() 2025-05-07T20:33:07.2654717Z x1 = x1.contiguous() 2025-05-07T20:33:07.2654972Z 2025-05-07T20:33:07.2655173Z if scale_ub is not None: 2025-05-07T20:33:07.2655446Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.2655790Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.2656112Z ) 2025-05-07T20:33:07.2656302Z else: 2025-05-07T20:33:07.2656518Z scale_ub_tensor = None 2025-05-07T20:33:07.2656780Z 2025-05-07T20:33:07.2657007Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.2657333Z op = silu_mul_quant 2025-05-07T20:33:07.2657589Z if compiled: 2025-05-07T20:33:07.2657848Z op = torch.compile(op) 2025-05-07T20:33:07.2658146Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.2658480Z 2025-05-07T20:33:07.2658680Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.2658845Z 2025-05-07T20:33:07.2658945Z moe/activation_test.py:117: 2025-05-07T20:33:07.2659248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.2659590Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.2659874Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.2660460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.2661056Z return fn(*args, **kwargs) 2025-05-07T20:33:07.2661744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.2662462Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.2663028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.2663750Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.2664444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.2665013Z kernel = self.compile( 2025-05-07T20:33:07.2665579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.2666278Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.2666683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.2666928Z 2025-05-07T20:33:07.2667137Z self = 2025-05-07T20:33:07.2668265Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.2669687Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857f2d1c0>} 2025-05-07T20:33:07.2671094Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.2672171Z context = 2025-05-07T20:33:07.2672534Z 2025-05-07T20:33:07.2672705Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.2673244Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.2673720Z module_map=module_map) 2025-05-07T20:33:07.2674198Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.2674564Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.2674832Z E ^ 2025-05-07T20:33:07.2675308Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.2675827Z 2025-05-07T20:33:07.2676264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.2676805Z 2025-05-07T20:33:07.2676917Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.2677340Z self=, 2025-05-07T20:33:07.2677777Z T=128, 2025-05-07T20:33:07.2677998Z D=7168, 2025-05-07T20:33:07.2678199Z scale_ub=1200.0, 2025-05-07T20:33:07.2678419Z contiguous=False, 2025-05-07T20:33:07.2678783Z compiled=False, 2025-05-07T20:33:07.2679084Z ) 2025-05-07T20:33:07.3829442Z self = 2025-05-07T20:33:07.3836331Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:07.3836630Z 2025-05-07T20:33:07.3836717Z @given( 2025-05-07T20:33:07.3836950Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.3837286Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.3837610Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.3837970Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.3838335Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.3838642Z ) 2025-05-07T20:33:07.3839000Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.3839468Z def test_silu_mul_quant( 2025-05-07T20:33:07.3839721Z self, 2025-05-07T20:33:07.3839917Z T: int, 2025-05-07T20:33:07.3840116Z D: int, 2025-05-07T20:33:07.3840338Z scale_ub: Optional[float], 2025-05-07T20:33:07.3840614Z contiguous: bool, 2025-05-07T20:33:07.3840865Z compiled: bool, 2025-05-07T20:33:07.3841099Z ) -> None: 2025-05-07T20:33:07.3841328Z torch.manual_seed(2025) 2025-05-07T20:33:07.3841568Z 2025-05-07T20:33:07.3841858Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.3842221Z 2025-05-07T20:33:07.3842411Z x_sign = torch.sign(x) 2025-05-07T20:33:07.3842708Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.3843029Z x = x_sign * x_clamp 2025-05-07T20:33:07.3843302Z x0 = x[:, :D] 2025-05-07T20:33:07.3843527Z x1 = x[:, D:] 2025-05-07T20:33:07.3843739Z 2025-05-07T20:33:07.3843936Z if contiguous: 2025-05-07T20:33:07.3844185Z x0 = x0.contiguous() 2025-05-07T20:33:07.3844447Z x1 = x1.contiguous() 2025-05-07T20:33:07.3844697Z 2025-05-07T20:33:07.3844895Z if scale_ub is not None: 2025-05-07T20:33:07.3845177Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.3845524Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.3845844Z ) 2025-05-07T20:33:07.3846036Z else: 2025-05-07T20:33:07.3846252Z scale_ub_tensor = None 2025-05-07T20:33:07.3846512Z 2025-05-07T20:33:07.3846746Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.3847075Z op = silu_mul_quant 2025-05-07T20:33:07.3847330Z if compiled: 2025-05-07T20:33:07.3847577Z op = torch.compile(op) 2025-05-07T20:33:07.3847894Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.3848361Z 2025-05-07T20:33:07.3848565Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.3848733Z 2025-05-07T20:33:07.3848838Z moe/activation_test.py:117: 2025-05-07T20:33:07.3849143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.3849488Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.3849890Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.3850619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.3851353Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.3851976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.3852692Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.3853394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.3853969Z kernel = self.compile( 2025-05-07T20:33:07.3854670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.3855359Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.3855783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.3856016Z 2025-05-07T20:33:07.3856235Z self = 2025-05-07T20:33:07.3857357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.3858792Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857f2cd60>} 2025-05-07T20:33:07.3860207Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.3861295Z context = 2025-05-07T20:33:07.3861597Z 2025-05-07T20:33:07.3861774Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.3862307Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.3862797Z module_map=module_map) 2025-05-07T20:33:07.3863168Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.3863531Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.3863799Z E ^ 2025-05-07T20:33:07.3864283Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.3864759Z 2025-05-07T20:33:07.3865204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.3865750Z 2025-05-07T20:33:07.3865853Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.3866291Z self=, 2025-05-07T20:33:07.3866716Z T=128, 2025-05-07T20:33:07.3866907Z D=5120, 2025-05-07T20:33:07.3867106Z scale_ub=None, 2025-05-07T20:33:07.3867334Z contiguous=False, 2025-05-07T20:33:07.3867557Z compiled=False, 2025-05-07T20:33:07.3867774Z ) 2025-05-07T20:33:07.3868106Z self = 2025-05-07T20:33:07.3868623Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:07.3868909Z 2025-05-07T20:33:07.3868993Z @given( 2025-05-07T20:33:07.3869230Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.3869555Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.3869924Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.3870263Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.3870600Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.3870886Z ) 2025-05-07T20:33:07.3871323Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.3871784Z def test_silu_mul_quant( 2025-05-07T20:33:07.3872028Z self, 2025-05-07T20:33:07.3872228Z T: int, 2025-05-07T20:33:07.3872428Z D: int, 2025-05-07T20:33:07.3872694Z scale_ub: Optional[float], 2025-05-07T20:33:07.3872966Z contiguous: bool, 2025-05-07T20:33:07.3873215Z compiled: bool, 2025-05-07T20:33:07.3873440Z ) -> None: 2025-05-07T20:33:07.3873654Z torch.manual_seed(2025) 2025-05-07T20:33:07.3873896Z 2025-05-07T20:33:07.3874172Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.3874525Z 2025-05-07T20:33:07.3874713Z x_sign = torch.sign(x) 2025-05-07T20:33:07.3875002Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.3875316Z x = x_sign * x_clamp 2025-05-07T20:33:07.3875551Z x0 = x[:, :D] 2025-05-07T20:33:07.3875763Z x1 = x[:, D:] 2025-05-07T20:33:07.3875969Z 2025-05-07T20:33:07.3876150Z if contiguous: 2025-05-07T20:33:07.3876376Z x0 = x0.contiguous() 2025-05-07T20:33:07.3876631Z x1 = x1.contiguous() 2025-05-07T20:33:07.3876876Z 2025-05-07T20:33:07.3877073Z if scale_ub is not None: 2025-05-07T20:33:07.3877342Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.3877683Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.3878024Z ) 2025-05-07T20:33:07.3878245Z else: 2025-05-07T20:33:07.3878454Z scale_ub_tensor = None 2025-05-07T20:33:07.3878708Z 2025-05-07T20:33:07.3878939Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.3879250Z op = silu_mul_quant 2025-05-07T20:33:07.3879493Z if compiled: 2025-05-07T20:33:07.3879738Z op = torch.compile(op) 2025-05-07T20:33:07.3880026Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.3880304Z 2025-05-07T20:33:07.3880491Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.3880653Z 2025-05-07T20:33:07.3880747Z moe/activation_test.py:117: 2025-05-07T20:33:07.3881032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.3881373Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.3881655Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.3882365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.3883084Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.3883638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.3884348Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.3885039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.3885599Z kernel = self.compile( 2025-05-07T20:33:07.3886167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.3886847Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.3887255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.3887488Z 2025-05-07T20:33:07.3887702Z self = 2025-05-07T20:33:07.3888873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.3890349Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857f2e160>} 2025-05-07T20:33:07.3891832Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.3892983Z context = 2025-05-07T20:33:07.3893283Z 2025-05-07T20:33:07.3893456Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.3893994Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.3894577Z module_map=module_map) 2025-05-07T20:33:07.3894955Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.3895326Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.3895592Z E ^ 2025-05-07T20:33:07.3896074Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.3896555Z 2025-05-07T20:33:07.3897005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.3897550Z 2025-05-07T20:33:07.3897656Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.3898090Z self=, 2025-05-07T20:33:07.3898503Z T=128, 2025-05-07T20:33:07.3898695Z D=5120, 2025-05-07T20:33:07.3898882Z scale_ub=1200.0, 2025-05-07T20:33:07.3899101Z contiguous=True, 2025-05-07T20:33:07.3899318Z compiled=False, 2025-05-07T20:33:07.3899523Z ) 2025-05-07T20:33:07.5646650Z self = 2025-05-07T20:33:07.5647195Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:07.5647487Z 2025-05-07T20:33:07.5647596Z @given( 2025-05-07T20:33:07.5647871Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5648209Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5648534Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5648875Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5649213Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5649516Z ) 2025-05-07T20:33:07.5649876Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5650329Z def test_silu_mul_quant( 2025-05-07T20:33:07.5650574Z self, 2025-05-07T20:33:07.5650770Z T: int, 2025-05-07T20:33:07.5650964Z D: int, 2025-05-07T20:33:07.5651192Z scale_ub: Optional[float], 2025-05-07T20:33:07.5651471Z contiguous: bool, 2025-05-07T20:33:07.5651716Z compiled: bool, 2025-05-07T20:33:07.5651945Z ) -> None: 2025-05-07T20:33:07.5652164Z torch.manual_seed(2025) 2025-05-07T20:33:07.5652407Z 2025-05-07T20:33:07.5652690Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5653045Z 2025-05-07T20:33:07.5653245Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5653539Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5653855Z x = x_sign * x_clamp 2025-05-07T20:33:07.5654100Z x0 = x[:, :D] 2025-05-07T20:33:07.5654317Z x1 = x[:, D:] 2025-05-07T20:33:07.5654602Z 2025-05-07T20:33:07.5654789Z if contiguous: 2025-05-07T20:33:07.5655025Z x0 = x0.contiguous() 2025-05-07T20:33:07.5655284Z x1 = x1.contiguous() 2025-05-07T20:33:07.5655531Z 2025-05-07T20:33:07.5655723Z if scale_ub is not None: 2025-05-07T20:33:07.5655996Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5656458Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5656776Z ) 2025-05-07T20:33:07.5656971Z else: 2025-05-07T20:33:07.5657180Z scale_ub_tensor = None 2025-05-07T20:33:07.5657438Z 2025-05-07T20:33:07.5657788Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5658116Z op = silu_mul_quant 2025-05-07T20:33:07.5658377Z if compiled: 2025-05-07T20:33:07.5658638Z op = torch.compile(op) 2025-05-07T20:33:07.5659003Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5659293Z 2025-05-07T20:33:07.5659488Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5659657Z 2025-05-07T20:33:07.5659761Z moe/activation_test.py:117: 2025-05-07T20:33:07.5660070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5660419Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5660717Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5661441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5662179Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5662752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5663467Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5664169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5664738Z kernel = self.compile( 2025-05-07T20:33:07.5665307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5666001Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5666418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5666666Z 2025-05-07T20:33:07.5666883Z self = 2025-05-07T20:33:07.5668013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5669490Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857f2f240>} 2025-05-07T20:33:07.5670901Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5671992Z context = 2025-05-07T20:33:07.5672296Z 2025-05-07T20:33:07.5672476Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5673008Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5673496Z module_map=module_map) 2025-05-07T20:33:07.5673876Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5674243Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5674508Z E ^ 2025-05-07T20:33:07.5674982Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5675454Z 2025-05-07T20:33:07.5675897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5676438Z 2025-05-07T20:33:07.5676546Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5676963Z self=, 2025-05-07T20:33:07.5677446Z T=1, 2025-05-07T20:33:07.5677638Z D=7168, 2025-05-07T20:33:07.5677832Z scale_ub=1200.0, 2025-05-07T20:33:07.5678065Z contiguous=True, 2025-05-07T20:33:07.5678292Z compiled=True, 2025-05-07T20:33:07.5678500Z ) 2025-05-07T20:33:07.5678903Z self = 2025-05-07T20:33:07.5679432Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:07.5679706Z 2025-05-07T20:33:07.5679796Z @given( 2025-05-07T20:33:07.5680033Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.5680395Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.5680709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.5681049Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.5681385Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.5681682Z ) 2025-05-07T20:33:07.5682048Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.5682501Z def test_silu_mul_quant( 2025-05-07T20:33:07.5682741Z self, 2025-05-07T20:33:07.5682932Z T: int, 2025-05-07T20:33:07.5683126Z D: int, 2025-05-07T20:33:07.5683348Z scale_ub: Optional[float], 2025-05-07T20:33:07.5683631Z contiguous: bool, 2025-05-07T20:33:07.5683873Z compiled: bool, 2025-05-07T20:33:07.5684090Z ) -> None: 2025-05-07T20:33:07.5684299Z torch.manual_seed(2025) 2025-05-07T20:33:07.5684531Z 2025-05-07T20:33:07.5684806Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.5685156Z 2025-05-07T20:33:07.5685340Z x_sign = torch.sign(x) 2025-05-07T20:33:07.5685629Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.5685940Z x = x_sign * x_clamp 2025-05-07T20:33:07.5686176Z x0 = x[:, :D] 2025-05-07T20:33:07.5686382Z x1 = x[:, D:] 2025-05-07T20:33:07.5686587Z 2025-05-07T20:33:07.5686773Z if contiguous: 2025-05-07T20:33:07.5686996Z x0 = x0.contiguous() 2025-05-07T20:33:07.5687255Z x1 = x1.contiguous() 2025-05-07T20:33:07.5687494Z 2025-05-07T20:33:07.5687677Z if scale_ub is not None: 2025-05-07T20:33:07.5687953Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.5688283Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.5688587Z ) 2025-05-07T20:33:07.5688778Z else: 2025-05-07T20:33:07.5688985Z scale_ub_tensor = None 2025-05-07T20:33:07.5689237Z 2025-05-07T20:33:07.5689463Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.5689781Z op = silu_mul_quant 2025-05-07T20:33:07.5690024Z if compiled: 2025-05-07T20:33:07.5690269Z op = torch.compile(op) 2025-05-07T20:33:07.5690567Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5690843Z 2025-05-07T20:33:07.5691029Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.5691197Z 2025-05-07T20:33:07.5691294Z moe/activation_test.py:117: 2025-05-07T20:33:07.5691591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5691923Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.5692208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.5692786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.5693360Z return fn(*args, **kwargs) 2025-05-07T20:33:07.5694046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.5694841Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.5695395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.5696104Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.5696854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.5697408Z kernel = self.compile( 2025-05-07T20:33:07.5698041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.5698722Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.5699128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.5699399Z 2025-05-07T20:33:07.5699611Z self = 2025-05-07T20:33:07.5700718Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.5702134Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857368900>} 2025-05-07T20:33:07.5703539Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.5704614Z context = 2025-05-07T20:33:07.5704908Z 2025-05-07T20:33:07.5705080Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.5705611Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.5706090Z module_map=module_map) 2025-05-07T20:33:07.5706455Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.5706808Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.5707066Z E ^ 2025-05-07T20:33:07.5707538Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.5708045Z 2025-05-07T20:33:07.5708504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.5709040Z 2025-05-07T20:33:07.5709143Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.5709563Z self=, 2025-05-07T20:33:07.5709977Z T=1, 2025-05-07T20:33:07.5710158Z D=7168, 2025-05-07T20:33:07.5710340Z scale_ub=1200.0, 2025-05-07T20:33:07.5710556Z contiguous=False, 2025-05-07T20:33:07.5710776Z compiled=True, 2025-05-07T20:33:07.5710972Z ) 2025-05-07T20:33:07.7132388Z self = 2025-05-07T20:33:07.7133002Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:07.7133308Z 2025-05-07T20:33:07.7133400Z @given( 2025-05-07T20:33:07.7133633Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:07.7133966Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:07.7134286Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:07.7134767Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:07.7135114Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:07.7135414Z ) 2025-05-07T20:33:07.7135769Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:07.7136243Z def test_silu_mul_quant( 2025-05-07T20:33:07.7136498Z self, 2025-05-07T20:33:07.7136702Z T: int, 2025-05-07T20:33:07.7136902Z D: int, 2025-05-07T20:33:07.7137128Z scale_ub: Optional[float], 2025-05-07T20:33:07.7137412Z contiguous: bool, 2025-05-07T20:33:07.7137656Z compiled: bool, 2025-05-07T20:33:07.7137913Z ) -> None: 2025-05-07T20:33:07.7138412Z torch.manual_seed(2025) 2025-05-07T20:33:07.7138659Z 2025-05-07T20:33:07.7138942Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:07.7139307Z 2025-05-07T20:33:07.7139501Z x_sign = torch.sign(x) 2025-05-07T20:33:07.7139964Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:07.7140319Z x = x_sign * x_clamp 2025-05-07T20:33:07.7150527Z x0 = x[:, :D] 2025-05-07T20:33:07.7150773Z x1 = x[:, D:] 2025-05-07T20:33:07.7151005Z 2025-05-07T20:33:07.7151342Z if contiguous: 2025-05-07T20:33:07.7151590Z x0 = x0.contiguous() 2025-05-07T20:33:07.7151872Z x1 = x1.contiguous() 2025-05-07T20:33:07.7152133Z 2025-05-07T20:33:07.7152331Z if scale_ub is not None: 2025-05-07T20:33:07.7152625Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:07.7152982Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:07.7153315Z ) 2025-05-07T20:33:07.7153518Z else: 2025-05-07T20:33:07.7153746Z scale_ub_tensor = None 2025-05-07T20:33:07.7154017Z 2025-05-07T20:33:07.7154259Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:07.7154606Z op = silu_mul_quant 2025-05-07T20:33:07.7154882Z if compiled: 2025-05-07T20:33:07.7155146Z op = torch.compile(op) 2025-05-07T20:33:07.7155659Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.7155955Z 2025-05-07T20:33:07.7156157Z > y_fp8, y_scale = fn() 2025-05-07T20:33:07.7156345Z 2025-05-07T20:33:07.7156453Z moe/activation_test.py:117: 2025-05-07T20:33:07.7156772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.7157134Z moe/activation_test.py:115: in fn 2025-05-07T20:33:07.7157424Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:07.7158063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:07.7158826Z return fn(*args, **kwargs) 2025-05-07T20:33:07.7159533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:07.7160281Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:07.7160854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:07.7161587Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:07.7162289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:07.7162864Z kernel = self.compile( 2025-05-07T20:33:07.7163440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:07.7164133Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:07.7164560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:07.7164814Z 2025-05-07T20:33:07.7165027Z self = 2025-05-07T20:33:07.7166167Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:07.7167607Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857369f80>} 2025-05-07T20:33:07.7169085Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:07.7170183Z context = 2025-05-07T20:33:07.7170567Z 2025-05-07T20:33:07.7170750Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:07.7171304Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:07.7171880Z module_map=module_map) 2025-05-07T20:33:07.7172280Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:07.7172663Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:07.7172941Z E ^ 2025-05-07T20:33:07.7173486Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:07.7173965Z 2025-05-07T20:33:07.7174530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:07.7175081Z 2025-05-07T20:33:07.7175206Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:07.7175634Z self=, 2025-05-07T20:33:07.7176066Z T=1, 2025-05-07T20:33:07.7176272Z D=7168, 2025-05-07T20:33:07.7176471Z scale_ub=None, 2025-05-07T20:33:07.7176712Z contiguous=False, 2025-05-07T20:33:07.7176954Z compiled=True, 2025-05-07T20:33:07.7177171Z ) 2025-05-07T20:33:08.0045221Z self = 2025-05-07T20:33:08.0045783Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:08.0046062Z 2025-05-07T20:33:08.0046161Z @given( 2025-05-07T20:33:08.0046436Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.0046770Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.0047096Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.0047438Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.0047787Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.0048103Z ) 2025-05-07T20:33:08.0048468Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.0048923Z def test_silu_mul_quant( 2025-05-07T20:33:08.0049177Z self, 2025-05-07T20:33:08.0049383Z T: int, 2025-05-07T20:33:08.0049582Z D: int, 2025-05-07T20:33:08.0049825Z scale_ub: Optional[float], 2025-05-07T20:33:08.0050106Z contiguous: bool, 2025-05-07T20:33:08.0050348Z compiled: bool, 2025-05-07T20:33:08.0050586Z ) -> None: 2025-05-07T20:33:08.0050810Z torch.manual_seed(2025) 2025-05-07T20:33:08.0051059Z 2025-05-07T20:33:08.0051343Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.0051702Z 2025-05-07T20:33:08.0051900Z x_sign = torch.sign(x) 2025-05-07T20:33:08.0052198Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.0052516Z x = x_sign * x_clamp 2025-05-07T20:33:08.0052757Z x0 = x[:, :D] 2025-05-07T20:33:08.0052983Z x1 = x[:, D:] 2025-05-07T20:33:08.0053200Z 2025-05-07T20:33:08.0053396Z if contiguous: 2025-05-07T20:33:08.0053631Z x0 = x0.contiguous() 2025-05-07T20:33:08.0053906Z x1 = x1.contiguous() 2025-05-07T20:33:08.0054152Z 2025-05-07T20:33:08.0054453Z if scale_ub is not None: 2025-05-07T20:33:08.0054738Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.0055093Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.0055408Z ) 2025-05-07T20:33:08.0055612Z else: 2025-05-07T20:33:08.0055833Z scale_ub_tensor = None 2025-05-07T20:33:08.0056092Z 2025-05-07T20:33:08.0056333Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.0056661Z op = silu_mul_quant 2025-05-07T20:33:08.0056913Z if compiled: 2025-05-07T20:33:08.0057172Z op = torch.compile(op) 2025-05-07T20:33:08.0057477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.0058010Z 2025-05-07T20:33:08.0058211Z y_fp8, y_scale = fn() 2025-05-07T20:33:08.0058505Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:08.0058805Z 2025-05-07T20:33:08.0059038Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.0059548Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:08.0059859Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:08.0060175Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:08.0060549Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.0060958Z 2025-05-07T20:33:08.0061160Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:08.0061370Z 2025-05-07T20:33:08.0061471Z moe/activation_test.py:126: 2025-05-07T20:33:08.0061778Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.0062122Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:08.0062461Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:08.0063295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:08.0064095Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:08.0064666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.0065382Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.0066113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:08.0066878Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:08.0067643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:08.0068321Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:08.0068954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:08.0069498Z fn() 2025-05-07T20:33:08.0070029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:08.0070646Z self.fn.run( 2025-05-07T20:33:08.0071132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.0071690Z kernel = self.compile( 2025-05-07T20:33:08.0072257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.0072947Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.0073368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.0073604Z 2025-05-07T20:33:08.0073819Z self = 2025-05-07T20:33:08.0074950Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.0076401Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f985736aca0>} 2025-05-07T20:33:08.0077821Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.0078951Z context = 2025-05-07T20:33:08.0079263Z 2025-05-07T20:33:08.0079434Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.0080040Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.0080536Z module_map=module_map) 2025-05-07T20:33:08.0080908Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.0081349Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:08.0081632Z E ^ 2025-05-07T20:33:08.0082105Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.0082585Z 2025-05-07T20:33:08.0083068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.0083619Z 2025-05-07T20:33:08.0083729Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.0084165Z self=, 2025-05-07T20:33:08.0084582Z T=1, 2025-05-07T20:33:08.0084793Z D=5120, 2025-05-07T20:33:08.0085007Z scale_ub=1200.0, 2025-05-07T20:33:08.0085240Z contiguous=False, 2025-05-07T20:33:08.0085482Z compiled=True, 2025-05-07T20:33:08.0085699Z ) 2025-05-07T20:33:08.1633805Z self = 2025-05-07T20:33:08.1634357Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:08.1634663Z 2025-05-07T20:33:08.1634754Z @given( 2025-05-07T20:33:08.1634983Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.1635314Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.1635635Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.1635980Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.1636318Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.1636611Z ) 2025-05-07T20:33:08.1636955Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.1637426Z def test_silu_mul_quant( 2025-05-07T20:33:08.1637672Z self, 2025-05-07T20:33:08.1637868Z T: int, 2025-05-07T20:33:08.1638059Z D: int, 2025-05-07T20:33:08.1638277Z scale_ub: Optional[float], 2025-05-07T20:33:08.1638557Z contiguous: bool, 2025-05-07T20:33:08.1638801Z compiled: bool, 2025-05-07T20:33:08.1639023Z ) -> None: 2025-05-07T20:33:08.1639237Z torch.manual_seed(2025) 2025-05-07T20:33:08.1639474Z 2025-05-07T20:33:08.1639743Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.1640101Z 2025-05-07T20:33:08.1640296Z x_sign = torch.sign(x) 2025-05-07T20:33:08.1640584Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.1640903Z x = x_sign * x_clamp 2025-05-07T20:33:08.1641143Z x0 = x[:, :D] 2025-05-07T20:33:08.1641355Z x1 = x[:, D:] 2025-05-07T20:33:08.1641564Z 2025-05-07T20:33:08.1641748Z if contiguous: 2025-05-07T20:33:08.1641973Z x0 = x0.contiguous() 2025-05-07T20:33:08.1642231Z x1 = x1.contiguous() 2025-05-07T20:33:08.1642472Z 2025-05-07T20:33:08.1642655Z if scale_ub is not None: 2025-05-07T20:33:08.1642928Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.1643271Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.1643585Z ) 2025-05-07T20:33:08.1643772Z else: 2025-05-07T20:33:08.1643980Z scale_ub_tensor = None 2025-05-07T20:33:08.1644232Z 2025-05-07T20:33:08.1644455Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.1644774Z op = silu_mul_quant 2025-05-07T20:33:08.1645024Z if compiled: 2025-05-07T20:33:08.1645266Z op = torch.compile(op) 2025-05-07T20:33:08.1645564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.1645845Z 2025-05-07T20:33:08.1646029Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.1646350Z 2025-05-07T20:33:08.1646447Z moe/activation_test.py:117: 2025-05-07T20:33:08.1646744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.1647077Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.1647361Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.1648103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.1648700Z return fn(*args, **kwargs) 2025-05-07T20:33:08.1649382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.1650182Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.1650745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.1651461Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.1652150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.1652717Z kernel = self.compile( 2025-05-07T20:33:08.1653285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.1653980Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.1654527Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.1654776Z 2025-05-07T20:33:08.1654987Z self = 2025-05-07T20:33:08.1656112Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.1657574Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857b74400>} 2025-05-07T20:33:08.1658994Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.1660081Z context = 2025-05-07T20:33:08.1660389Z 2025-05-07T20:33:08.1660561Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.1661106Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.1661601Z module_map=module_map) 2025-05-07T20:33:08.1661973Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.1662344Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.1662614Z E ^ 2025-05-07T20:33:08.1663096Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.1663578Z 2025-05-07T20:33:08.1664016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.1664567Z 2025-05-07T20:33:08.1664679Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.1665111Z self=, 2025-05-07T20:33:08.1665526Z T=1, 2025-05-07T20:33:08.1665717Z D=5120, 2025-05-07T20:33:08.1665917Z scale_ub=1200.0, 2025-05-07T20:33:08.1666148Z contiguous=False, 2025-05-07T20:33:08.1666381Z compiled=False, 2025-05-07T20:33:08.1666595Z ) 2025-05-07T20:33:08.1666919Z self = 2025-05-07T20:33:08.1667428Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:08.1667706Z 2025-05-07T20:33:08.1667793Z @given( 2025-05-07T20:33:08.1668106Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.1668429Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.1668745Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.1669083Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.1669488Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.1669785Z ) 2025-05-07T20:33:08.1670147Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.1670601Z def test_silu_mul_quant( 2025-05-07T20:33:08.1670890Z self, 2025-05-07T20:33:08.1671110Z T: int, 2025-05-07T20:33:08.1671379Z D: int, 2025-05-07T20:33:08.1671654Z scale_ub: Optional[float], 2025-05-07T20:33:08.1671933Z contiguous: bool, 2025-05-07T20:33:08.1672167Z compiled: bool, 2025-05-07T20:33:08.1672390Z ) -> None: 2025-05-07T20:33:08.1672607Z torch.manual_seed(2025) 2025-05-07T20:33:08.1672848Z 2025-05-07T20:33:08.1673122Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.1673477Z 2025-05-07T20:33:08.1673680Z x_sign = torch.sign(x) 2025-05-07T20:33:08.1673969Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.1674295Z x = x_sign * x_clamp 2025-05-07T20:33:08.1674539Z x0 = x[:, :D] 2025-05-07T20:33:08.1674755Z x1 = x[:, D:] 2025-05-07T20:33:08.1674964Z 2025-05-07T20:33:08.1675149Z if contiguous: 2025-05-07T20:33:08.1675378Z x0 = x0.contiguous() 2025-05-07T20:33:08.1675643Z x1 = x1.contiguous() 2025-05-07T20:33:08.1675888Z 2025-05-07T20:33:08.1676074Z if scale_ub is not None: 2025-05-07T20:33:08.1676349Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.1676690Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.1677004Z ) 2025-05-07T20:33:08.1677197Z else: 2025-05-07T20:33:08.1677416Z scale_ub_tensor = None 2025-05-07T20:33:08.1677675Z 2025-05-07T20:33:08.1677908Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.1678230Z op = silu_mul_quant 2025-05-07T20:33:08.1678501Z if compiled: 2025-05-07T20:33:08.1678779Z op = torch.compile(op) 2025-05-07T20:33:08.1679265Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.1679552Z 2025-05-07T20:33:08.1679738Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.1679910Z 2025-05-07T20:33:08.1680009Z moe/activation_test.py:117: 2025-05-07T20:33:08.1680313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.1680649Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.1680935Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.1681663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.1682572Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.1683199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.1683918Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.1684622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.1685174Z kernel = self.compile( 2025-05-07T20:33:08.1685737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.1686430Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.1686840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.1687076Z 2025-05-07T20:33:08.1687285Z self = 2025-05-07T20:33:08.1688455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.1690037Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f994223c2c0>} 2025-05-07T20:33:08.1691453Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.1692574Z context = 2025-05-07T20:33:08.1692882Z 2025-05-07T20:33:08.1693054Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.1693603Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.1694102Z module_map=module_map) 2025-05-07T20:33:08.1694558Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.1694928Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.1695204Z E ^ 2025-05-07T20:33:08.1695692Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.1696176Z 2025-05-07T20:33:08.1696615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.1697173Z 2025-05-07T20:33:08.1697281Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.1697714Z self=, 2025-05-07T20:33:08.1698141Z T=16384, 2025-05-07T20:33:08.1698349Z D=5120, 2025-05-07T20:33:08.1698551Z scale_ub=1200.0, 2025-05-07T20:33:08.1698779Z contiguous=False, 2025-05-07T20:33:08.1699022Z compiled=True, 2025-05-07T20:33:08.1699235Z ) 2025-05-07T20:33:08.2581060Z self = 2025-05-07T20:33:08.2582166Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:08.2582743Z 2025-05-07T20:33:08.2582908Z @given( 2025-05-07T20:33:08.2583386Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.2584022Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.2584630Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.2585297Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.2585956Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.2586519Z ) 2025-05-07T20:33:08.2587213Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.2588015Z def test_silu_mul_quant( 2025-05-07T20:33:08.2588297Z self, 2025-05-07T20:33:08.2588497Z T: int, 2025-05-07T20:33:08.2588705Z D: int, 2025-05-07T20:33:08.2588936Z scale_ub: Optional[float], 2025-05-07T20:33:08.2589211Z contiguous: bool, 2025-05-07T20:33:08.2589459Z compiled: bool, 2025-05-07T20:33:08.2589687Z ) -> None: 2025-05-07T20:33:08.2589896Z torch.manual_seed(2025) 2025-05-07T20:33:08.2590147Z 2025-05-07T20:33:08.2590428Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.2590775Z 2025-05-07T20:33:08.2590975Z x_sign = torch.sign(x) 2025-05-07T20:33:08.2591271Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.2591578Z x = x_sign * x_clamp 2025-05-07T20:33:08.2591820Z x0 = x[:, :D] 2025-05-07T20:33:08.2592036Z x1 = x[:, D:] 2025-05-07T20:33:08.2592235Z 2025-05-07T20:33:08.2592425Z if contiguous: 2025-05-07T20:33:08.2592660Z x0 = x0.contiguous() 2025-05-07T20:33:08.2592918Z x1 = x1.contiguous() 2025-05-07T20:33:08.2593273Z 2025-05-07T20:33:08.2593470Z if scale_ub is not None: 2025-05-07T20:33:08.2593746Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.2594075Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.2602866Z ) 2025-05-07T20:33:08.2603102Z else: 2025-05-07T20:33:08.2603495Z scale_ub_tensor = None 2025-05-07T20:33:08.2603780Z 2025-05-07T20:33:08.2604041Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.2604384Z op = silu_mul_quant 2025-05-07T20:33:08.2604713Z if compiled: 2025-05-07T20:33:08.2604981Z op = torch.compile(op) 2025-05-07T20:33:08.2605303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.2605610Z 2025-05-07T20:33:08.2605811Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.2605992Z 2025-05-07T20:33:08.2606097Z moe/activation_test.py:117: 2025-05-07T20:33:08.2606418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.2606775Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.2607079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.2607684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.2608295Z return fn(*args, **kwargs) 2025-05-07T20:33:08.2609005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.2609749Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.2610337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.2611067Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.2611784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.2612372Z kernel = self.compile( 2025-05-07T20:33:08.2612958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.2613662Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.2614097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.2614437Z 2025-05-07T20:33:08.2614670Z self = 2025-05-07T20:33:08.2615827Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.2617288Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9942295da0>} 2025-05-07T20:33:08.2618731Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.2619851Z context = 2025-05-07T20:33:08.2620160Z 2025-05-07T20:33:08.2620351Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.2620900Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.2621405Z module_map=module_map) 2025-05-07T20:33:08.2621795Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.2622171Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.2622447Z E ^ 2025-05-07T20:33:08.2622944Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.2623424Z 2025-05-07T20:33:08.2623958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.2624508Z 2025-05-07T20:33:08.2624620Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.2625065Z self=, 2025-05-07T20:33:08.2625814Z T=2048, 2025-05-07T20:33:08.2626029Z D=7168, 2025-05-07T20:33:08.2626255Z scale_ub=1200.0, 2025-05-07T20:33:08.2626504Z contiguous=False, 2025-05-07T20:33:08.2626753Z compiled=True, 2025-05-07T20:33:08.2627041Z ) 2025-05-07T20:33:08.2627382Z self = 2025-05-07T20:33:08.2627920Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:08.2628217Z 2025-05-07T20:33:08.2628311Z @given( 2025-05-07T20:33:08.2628593Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.2628950Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.2629289Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.2629640Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.2630000Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.2630313Z ) 2025-05-07T20:33:08.2630692Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.2631164Z def test_silu_mul_quant( 2025-05-07T20:33:08.2631417Z self, 2025-05-07T20:33:08.2631618Z T: int, 2025-05-07T20:33:08.2631824Z D: int, 2025-05-07T20:33:08.2632061Z scale_ub: Optional[float], 2025-05-07T20:33:08.2632352Z contiguous: bool, 2025-05-07T20:33:08.2632599Z compiled: bool, 2025-05-07T20:33:08.2632843Z ) -> None: 2025-05-07T20:33:08.2633073Z torch.manual_seed(2025) 2025-05-07T20:33:08.2633311Z 2025-05-07T20:33:08.2633587Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.2633943Z 2025-05-07T20:33:08.2634132Z x_sign = torch.sign(x) 2025-05-07T20:33:08.2634423Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.2634739Z x = x_sign * x_clamp 2025-05-07T20:33:08.2634974Z x0 = x[:, :D] 2025-05-07T20:33:08.2635189Z x1 = x[:, D:] 2025-05-07T20:33:08.2635413Z 2025-05-07T20:33:08.2635603Z if contiguous: 2025-05-07T20:33:08.2635852Z x0 = x0.contiguous() 2025-05-07T20:33:08.2636131Z x1 = x1.contiguous() 2025-05-07T20:33:08.2636390Z 2025-05-07T20:33:08.2636588Z if scale_ub is not None: 2025-05-07T20:33:08.2636879Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.2637234Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.2637558Z ) 2025-05-07T20:33:08.2637773Z else: 2025-05-07T20:33:08.2638006Z scale_ub_tensor = None 2025-05-07T20:33:08.2638323Z 2025-05-07T20:33:08.2638573Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.2638909Z op = silu_mul_quant 2025-05-07T20:33:08.2639168Z if compiled: 2025-05-07T20:33:08.2639435Z op = torch.compile(op) 2025-05-07T20:33:08.2639751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.2640039Z 2025-05-07T20:33:08.2640263Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.2640433Z 2025-05-07T20:33:08.2640548Z moe/activation_test.py:117: 2025-05-07T20:33:08.2640865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.2641228Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.2641530Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.2642128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.2642726Z return fn(*args, **kwargs) 2025-05-07T20:33:08.2643432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.2644254Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.2644842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.2645649Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.2646370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.2646955Z kernel = self.compile( 2025-05-07T20:33:08.2647577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.2648292Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.2648731Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.2648982Z 2025-05-07T20:33:08.2649212Z self = 2025-05-07T20:33:08.2650353Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.2651807Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f99422949a0>} 2025-05-07T20:33:08.2653244Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.2654421Z context = 2025-05-07T20:33:08.2654729Z 2025-05-07T20:33:08.2654918Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.2655471Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.2655985Z module_map=module_map) 2025-05-07T20:33:08.2656387Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.2656762Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.2657056Z E ^ 2025-05-07T20:33:08.2657557Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.2658037Z 2025-05-07T20:33:08.2658490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.2659044Z 2025-05-07T20:33:08.3814952Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.3815444Z self=, 2025-05-07T20:33:08.3815884Z T=1, 2025-05-07T20:33:08.3816080Z D=5120, 2025-05-07T20:33:08.3816271Z scale_ub=None, 2025-05-07T20:33:08.3816504Z contiguous=False, 2025-05-07T20:33:08.3816741Z compiled=False, 2025-05-07T20:33:08.3816943Z ) 2025-05-07T20:33:08.3817272Z self = 2025-05-07T20:33:08.3817787Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:08.3818090Z 2025-05-07T20:33:08.3818179Z @given( 2025-05-07T20:33:08.3818427Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.3818759Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.3819071Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.3819405Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.3819742Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.3820042Z ) 2025-05-07T20:33:08.3820392Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.3820852Z def test_silu_mul_quant( 2025-05-07T20:33:08.3821223Z self, 2025-05-07T20:33:08.3821425Z T: int, 2025-05-07T20:33:08.3821619Z D: int, 2025-05-07T20:33:08.3821845Z scale_ub: Optional[float], 2025-05-07T20:33:08.3822127Z contiguous: bool, 2025-05-07T20:33:08.3822368Z compiled: bool, 2025-05-07T20:33:08.3822602Z ) -> None: 2025-05-07T20:33:08.3822957Z torch.manual_seed(2025) 2025-05-07T20:33:08.3823199Z 2025-05-07T20:33:08.3823480Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.3823842Z 2025-05-07T20:33:08.3824034Z x_sign = torch.sign(x) 2025-05-07T20:33:08.3824387Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.3824715Z x = x_sign * x_clamp 2025-05-07T20:33:08.3824963Z x0 = x[:, :D] 2025-05-07T20:33:08.3825199Z x1 = x[:, D:] 2025-05-07T20:33:08.3825591Z 2025-05-07T20:33:08.3825787Z if contiguous: 2025-05-07T20:33:08.3826036Z x0 = x0.contiguous() 2025-05-07T20:33:08.3826312Z x1 = x1.contiguous() 2025-05-07T20:33:08.3826562Z 2025-05-07T20:33:08.3826769Z if scale_ub is not None: 2025-05-07T20:33:08.3827056Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.3827411Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.3827736Z ) 2025-05-07T20:33:08.3827937Z else: 2025-05-07T20:33:08.3828164Z scale_ub_tensor = None 2025-05-07T20:33:08.3828423Z 2025-05-07T20:33:08.3828667Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.3829000Z op = silu_mul_quant 2025-05-07T20:33:08.3829256Z if compiled: 2025-05-07T20:33:08.3829502Z op = torch.compile(op) 2025-05-07T20:33:08.3829813Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.3830086Z 2025-05-07T20:33:08.3830288Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.3830454Z 2025-05-07T20:33:08.3830558Z moe/activation_test.py:117: 2025-05-07T20:33:08.3830858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.3831203Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.3831492Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.3832221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.3832944Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.3833501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.3834218Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.3834912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.3835463Z kernel = self.compile( 2025-05-07T20:33:08.3836020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.3836707Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.3837122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.3837361Z 2025-05-07T20:33:08.3837578Z self = 2025-05-07T20:33:08.3838694Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.3840119Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9942713880>} 2025-05-07T20:33:08.3841523Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.3842684Z context = 2025-05-07T20:33:08.3842981Z 2025-05-07T20:33:08.3843155Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.3843798Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.3844283Z module_map=module_map) 2025-05-07T20:33:08.3844655Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.3845099Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.3845368Z E ^ 2025-05-07T20:33:08.3845844Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.3846314Z 2025-05-07T20:33:08.3846753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.3847299Z 2025-05-07T20:33:08.3847401Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.3847821Z self=, 2025-05-07T20:33:08.3848236Z T=4096, 2025-05-07T20:33:08.3848413Z D=7168, 2025-05-07T20:33:08.3848609Z scale_ub=1200.0, 2025-05-07T20:33:08.3848842Z contiguous=False, 2025-05-07T20:33:08.3849068Z compiled=False, 2025-05-07T20:33:08.3849278Z ) 2025-05-07T20:33:08.3849604Z self = 2025-05-07T20:33:08.3850128Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:08.3850416Z 2025-05-07T20:33:08.3850497Z @given( 2025-05-07T20:33:08.3850740Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.3851056Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.3851365Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.3851691Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.3852024Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.3852314Z ) 2025-05-07T20:33:08.3852663Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.3853119Z def test_silu_mul_quant( 2025-05-07T20:33:08.3853364Z self, 2025-05-07T20:33:08.3853553Z T: int, 2025-05-07T20:33:08.3853744Z D: int, 2025-05-07T20:33:08.3853958Z scale_ub: Optional[float], 2025-05-07T20:33:08.3854230Z contiguous: bool, 2025-05-07T20:33:08.3854562Z compiled: bool, 2025-05-07T20:33:08.3854778Z ) -> None: 2025-05-07T20:33:08.3854988Z torch.manual_seed(2025) 2025-05-07T20:33:08.3855219Z 2025-05-07T20:33:08.3855494Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.3855835Z 2025-05-07T20:33:08.3856022Z x_sign = torch.sign(x) 2025-05-07T20:33:08.3856310Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.3856623Z x = x_sign * x_clamp 2025-05-07T20:33:08.3856855Z x0 = x[:, :D] 2025-05-07T20:33:08.3857065Z x1 = x[:, D:] 2025-05-07T20:33:08.3857268Z 2025-05-07T20:33:08.3857446Z if contiguous: 2025-05-07T20:33:08.3857675Z x0 = x0.contiguous() 2025-05-07T20:33:08.3857942Z x1 = x1.contiguous() 2025-05-07T20:33:08.3858175Z 2025-05-07T20:33:08.3858361Z if scale_ub is not None: 2025-05-07T20:33:08.3858633Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.3858972Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.3859285Z ) 2025-05-07T20:33:08.3859469Z else: 2025-05-07T20:33:08.3859683Z scale_ub_tensor = None 2025-05-07T20:33:08.3859936Z 2025-05-07T20:33:08.3860167Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.3860485Z op = silu_mul_quant 2025-05-07T20:33:08.3860728Z if compiled: 2025-05-07T20:33:08.3861028Z op = torch.compile(op) 2025-05-07T20:33:08.3861330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.3861602Z 2025-05-07T20:33:08.3861788Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.3861948Z 2025-05-07T20:33:08.3862051Z moe/activation_test.py:117: 2025-05-07T20:33:08.3862470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.3862809Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.3863090Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.3863841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.3864558Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.3865115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.3865833Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.3866523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.3867083Z kernel = self.compile( 2025-05-07T20:33:08.3867649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.3868340Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.3868745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.3868991Z 2025-05-07T20:33:08.3869200Z self = 2025-05-07T20:33:08.3870318Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.3871742Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9949700e00>} 2025-05-07T20:33:08.3873149Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.3874224Z context = 2025-05-07T20:33:08.3874525Z 2025-05-07T20:33:08.3874691Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.3875225Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.3875694Z module_map=module_map) 2025-05-07T20:33:08.3876060Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.3876418Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.3876686Z E ^ 2025-05-07T20:33:08.3877153Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.3877627Z 2025-05-07T20:33:08.3878067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.3878631Z 2025-05-07T20:33:08.3878747Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.3879183Z self=, 2025-05-07T20:33:08.3879592Z T=16384, 2025-05-07T20:33:08.3879786Z D=7168, 2025-05-07T20:33:08.3879984Z scale_ub=None, 2025-05-07T20:33:08.3880188Z contiguous=True, 2025-05-07T20:33:08.3880417Z compiled=True, 2025-05-07T20:33:08.3880619Z ) 2025-05-07T20:33:08.5672761Z self = 2025-05-07T20:33:08.5673297Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:08.5673739Z 2025-05-07T20:33:08.5673827Z @given( 2025-05-07T20:33:08.5674063Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5674380Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5674688Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5675139Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5675476Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5675785Z ) 2025-05-07T20:33:08.5676132Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5676645Z def test_silu_mul_quant( 2025-05-07T20:33:08.5676887Z self, 2025-05-07T20:33:08.5677076Z T: int, 2025-05-07T20:33:08.5677273Z D: int, 2025-05-07T20:33:08.5677491Z scale_ub: Optional[float], 2025-05-07T20:33:08.5677756Z contiguous: bool, 2025-05-07T20:33:08.5677997Z compiled: bool, 2025-05-07T20:33:08.5678226Z ) -> None: 2025-05-07T20:33:08.5678451Z torch.manual_seed(2025) 2025-05-07T20:33:08.5678694Z 2025-05-07T20:33:08.5678972Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5679321Z 2025-05-07T20:33:08.5679519Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5679816Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5680132Z x = x_sign * x_clamp 2025-05-07T20:33:08.5680368Z x0 = x[:, :D] 2025-05-07T20:33:08.5680591Z x1 = x[:, D:] 2025-05-07T20:33:08.5680798Z 2025-05-07T20:33:08.5680980Z if contiguous: 2025-05-07T20:33:08.5681216Z x0 = x0.contiguous() 2025-05-07T20:33:08.5681478Z x1 = x1.contiguous() 2025-05-07T20:33:08.5681716Z 2025-05-07T20:33:08.5681910Z if scale_ub is not None: 2025-05-07T20:33:08.5682188Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5682519Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5682834Z ) 2025-05-07T20:33:08.5683029Z else: 2025-05-07T20:33:08.5683242Z scale_ub_tensor = None 2025-05-07T20:33:08.5683500Z 2025-05-07T20:33:08.5683734Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5684052Z op = silu_mul_quant 2025-05-07T20:33:08.5684313Z if compiled: 2025-05-07T20:33:08.5684564Z op = torch.compile(op) 2025-05-07T20:33:08.5684866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5685146Z 2025-05-07T20:33:08.5685338Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5685505Z 2025-05-07T20:33:08.5685612Z moe/activation_test.py:117: 2025-05-07T20:33:08.5685907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5686247Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5686531Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5687109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5687702Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5688389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5689116Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5689679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5690399Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5691100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5691656Z kernel = self.compile( 2025-05-07T20:33:08.5692216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5692907Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5693371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5693609Z 2025-05-07T20:33:08.5693818Z self = 2025-05-07T20:33:08.5695160Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5696602Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f994946ae80>} 2025-05-07T20:33:08.5698060Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5699201Z context = 2025-05-07T20:33:08.5699504Z 2025-05-07T20:33:08.5699675Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5700217Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5700714Z module_map=module_map) 2025-05-07T20:33:08.5701089Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5701459Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5701731Z E ^ 2025-05-07T20:33:08.5702208Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5702683Z 2025-05-07T20:33:08.5703119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5703666Z 2025-05-07T20:33:08.5703772Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.5704201Z self=, 2025-05-07T20:33:08.5704612Z T=4096, 2025-05-07T20:33:08.5704809Z D=5120, 2025-05-07T20:33:08.5705007Z scale_ub=None, 2025-05-07T20:33:08.5705223Z contiguous=False, 2025-05-07T20:33:08.5705440Z compiled=True, 2025-05-07T20:33:08.5705640Z ) 2025-05-07T20:33:08.5705971Z self = 2025-05-07T20:33:08.5706479Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:08.5706768Z 2025-05-07T20:33:08.5706848Z @given( 2025-05-07T20:33:08.5712990Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.5713320Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.5713709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.5714173Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.5714614Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.5715000Z ) 2025-05-07T20:33:08.5715439Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.5715908Z def test_silu_mul_quant( 2025-05-07T20:33:08.5716165Z self, 2025-05-07T20:33:08.5716364Z T: int, 2025-05-07T20:33:08.5716566Z D: int, 2025-05-07T20:33:08.5716793Z scale_ub: Optional[float], 2025-05-07T20:33:08.5717071Z contiguous: bool, 2025-05-07T20:33:08.5717317Z compiled: bool, 2025-05-07T20:33:08.5717549Z ) -> None: 2025-05-07T20:33:08.5717760Z torch.manual_seed(2025) 2025-05-07T20:33:08.5718014Z 2025-05-07T20:33:08.5718279Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.5718637Z 2025-05-07T20:33:08.5718821Z x_sign = torch.sign(x) 2025-05-07T20:33:08.5719108Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.5719421Z x = x_sign * x_clamp 2025-05-07T20:33:08.5719652Z x0 = x[:, :D] 2025-05-07T20:33:08.5719949Z x1 = x[:, D:] 2025-05-07T20:33:08.5720157Z 2025-05-07T20:33:08.5720333Z if contiguous: 2025-05-07T20:33:08.5720566Z x0 = x0.contiguous() 2025-05-07T20:33:08.5720829Z x1 = x1.contiguous() 2025-05-07T20:33:08.5721061Z 2025-05-07T20:33:08.5721350Z if scale_ub is not None: 2025-05-07T20:33:08.5721625Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.5721955Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.5722265Z ) 2025-05-07T20:33:08.5722457Z else: 2025-05-07T20:33:08.5722713Z scale_ub_tensor = None 2025-05-07T20:33:08.5722971Z 2025-05-07T20:33:08.5723201Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.5723517Z op = silu_mul_quant 2025-05-07T20:33:08.5723757Z if compiled: 2025-05-07T20:33:08.5723994Z op = torch.compile(op) 2025-05-07T20:33:08.5724287Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5724560Z 2025-05-07T20:33:08.5724742Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.5724904Z 2025-05-07T20:33:08.5725004Z moe/activation_test.py:117: 2025-05-07T20:33:08.5725296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5725841Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.5726123Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.5726698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.5727284Z return fn(*args, **kwargs) 2025-05-07T20:33:08.5727963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.5728685Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.5729290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.5729997Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.5730689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.5731244Z kernel = self.compile( 2025-05-07T20:33:08.5731798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.5732476Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.5732877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.5733116Z 2025-05-07T20:33:08.5733320Z self = 2025-05-07T20:33:08.5734541Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.5735967Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857cb7ba0>} 2025-05-07T20:33:08.5737375Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.5738455Z context = 2025-05-07T20:33:08.5738752Z 2025-05-07T20:33:08.5738916Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.5739451Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.5739927Z module_map=module_map) 2025-05-07T20:33:08.5740291Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.5740728Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.5740989Z E ^ 2025-05-07T20:33:08.5741456Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.5741923Z 2025-05-07T20:33:08.5742475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.5743031Z 2025-05-07T20:33:08.7229713Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.7230161Z self=, 2025-05-07T20:33:08.7230754Z T=4096, 2025-05-07T20:33:08.7230946Z D=5120, 2025-05-07T20:33:08.7231144Z scale_ub=1200.0, 2025-05-07T20:33:08.7231369Z contiguous=False, 2025-05-07T20:33:08.7231598Z compiled=False, 2025-05-07T20:33:08.7231810Z ) 2025-05-07T20:33:08.7232130Z self = 2025-05-07T20:33:08.7232652Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:08.7232939Z 2025-05-07T20:33:08.7233026Z @given( 2025-05-07T20:33:08.7233251Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.7233569Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.7233896Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.7234239Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.7234574Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.7234867Z ) 2025-05-07T20:33:08.7235229Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.7235677Z def test_silu_mul_quant( 2025-05-07T20:33:08.7235913Z self, 2025-05-07T20:33:08.7236105Z T: int, 2025-05-07T20:33:08.7236298Z D: int, 2025-05-07T20:33:08.7236512Z scale_ub: Optional[float], 2025-05-07T20:33:08.7236779Z contiguous: bool, 2025-05-07T20:33:08.7237018Z compiled: bool, 2025-05-07T20:33:08.7237248Z ) -> None: 2025-05-07T20:33:08.7237470Z torch.manual_seed(2025) 2025-05-07T20:33:08.7237709Z 2025-05-07T20:33:08.7237987Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.7238345Z 2025-05-07T20:33:08.7238544Z x_sign = torch.sign(x) 2025-05-07T20:33:08.7238852Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.7239167Z x = x_sign * x_clamp 2025-05-07T20:33:08.7239409Z x0 = x[:, :D] 2025-05-07T20:33:08.7239624Z x1 = x[:, D:] 2025-05-07T20:33:08.7239832Z 2025-05-07T20:33:08.7240020Z if contiguous: 2025-05-07T20:33:08.7240245Z x0 = x0.contiguous() 2025-05-07T20:33:08.7240507Z x1 = x1.contiguous() 2025-05-07T20:33:08.7240748Z 2025-05-07T20:33:08.7240932Z if scale_ub is not None: 2025-05-07T20:33:08.7241201Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.7241541Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.7241860Z ) 2025-05-07T20:33:08.7242061Z else: 2025-05-07T20:33:08.7242273Z scale_ub_tensor = None 2025-05-07T20:33:08.7242521Z 2025-05-07T20:33:08.7242758Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.7243087Z op = silu_mul_quant 2025-05-07T20:33:08.7243342Z if compiled: 2025-05-07T20:33:08.7243587Z op = torch.compile(op) 2025-05-07T20:33:08.7243891Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.7244173Z 2025-05-07T20:33:08.7244361Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.7244536Z 2025-05-07T20:33:08.7244634Z moe/activation_test.py:117: 2025-05-07T20:33:08.7244938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.7245275Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.7245561Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.7246355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.7247076Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.7247755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.7248637Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.7249351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.7249975Z kernel = self.compile( 2025-05-07T20:33:08.7250547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.7251244Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.7251663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.7251910Z 2025-05-07T20:33:08.7252122Z self = 2025-05-07T20:33:08.7253248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.7254760Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f985783a2a0>} 2025-05-07T20:33:08.7256176Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.7257267Z context = 2025-05-07T20:33:08.7257571Z 2025-05-07T20:33:08.7257742Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.7258294Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.7258786Z module_map=module_map) 2025-05-07T20:33:08.7259156Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.7259525Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.7259792Z E ^ 2025-05-07T20:33:08.7260268Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.7260752Z 2025-05-07T20:33:08.7261189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.7261738Z 2025-05-07T20:33:08.7261844Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.7262272Z self=, 2025-05-07T20:33:08.7262684Z T=4096, 2025-05-07T20:33:08.7262882Z D=5120, 2025-05-07T20:33:08.7263078Z scale_ub=1200.0, 2025-05-07T20:33:08.7263299Z contiguous=False, 2025-05-07T20:33:08.7263527Z compiled=True, 2025-05-07T20:33:08.7263733Z ) 2025-05-07T20:33:08.7264057Z self = 2025-05-07T20:33:08.7264578Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:08.7264870Z 2025-05-07T20:33:08.7264947Z @given( 2025-05-07T20:33:08.7265181Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.7265505Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.7265822Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.7266161Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.7266492Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.7266787Z ) 2025-05-07T20:33:08.7267146Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.7267658Z def test_silu_mul_quant( 2025-05-07T20:33:08.7267904Z self, 2025-05-07T20:33:08.7268105Z T: int, 2025-05-07T20:33:08.7268333Z D: int, 2025-05-07T20:33:08.7268575Z scale_ub: Optional[float], 2025-05-07T20:33:08.7268841Z contiguous: bool, 2025-05-07T20:33:08.7269154Z compiled: bool, 2025-05-07T20:33:08.7269377Z ) -> None: 2025-05-07T20:33:08.7269584Z torch.manual_seed(2025) 2025-05-07T20:33:08.7269827Z 2025-05-07T20:33:08.7270100Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.7270486Z 2025-05-07T20:33:08.7270672Z x_sign = torch.sign(x) 2025-05-07T20:33:08.7270958Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.7271267Z x = x_sign * x_clamp 2025-05-07T20:33:08.7271503Z x0 = x[:, :D] 2025-05-07T20:33:08.7271718Z x1 = x[:, D:] 2025-05-07T20:33:08.7271914Z 2025-05-07T20:33:08.7272100Z if contiguous: 2025-05-07T20:33:08.7272331Z x0 = x0.contiguous() 2025-05-07T20:33:08.7272586Z x1 = x1.contiguous() 2025-05-07T20:33:08.7272829Z 2025-05-07T20:33:08.7273015Z if scale_ub is not None: 2025-05-07T20:33:08.7273284Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.7273626Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.7273936Z ) 2025-05-07T20:33:08.7274129Z else: 2025-05-07T20:33:08.7274331Z scale_ub_tensor = None 2025-05-07T20:33:08.7274583Z 2025-05-07T20:33:08.7274817Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.7275127Z op = silu_mul_quant 2025-05-07T20:33:08.7275372Z if compiled: 2025-05-07T20:33:08.7275616Z op = torch.compile(op) 2025-05-07T20:33:08.7275910Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.7276200Z 2025-05-07T20:33:08.7276386Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.7276552Z 2025-05-07T20:33:08.7276653Z moe/activation_test.py:117: 2025-05-07T20:33:08.7276945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.7277285Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.7277567Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.7278144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:08.7278728Z return fn(*args, **kwargs) 2025-05-07T20:33:08.7279460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.7280185Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.7280738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.7281445Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.7282135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.7282686Z kernel = self.compile( 2025-05-07T20:33:08.7283240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.7283926Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.7284332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.7284566Z 2025-05-07T20:33:08.7284773Z self = 2025-05-07T20:33:08.7285896Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.7287314Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f985783a520>} 2025-05-07T20:33:08.7288922Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.7290011Z context = 2025-05-07T20:33:08.7290312Z 2025-05-07T20:33:08.7290482Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.7291065Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.7291550Z module_map=module_map) 2025-05-07T20:33:08.7291922Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.7292288Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.7292558Z E ^ 2025-05-07T20:33:08.7293035Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.7293513Z 2025-05-07T20:33:08.7293951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.7294589Z 2025-05-07T20:33:08.8458169Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.8459055Z self=, 2025-05-07T20:33:08.8459952Z T=2048, 2025-05-07T20:33:08.8460333Z D=7168, 2025-05-07T20:33:08.8460724Z scale_ub=1200.0, 2025-05-07T20:33:08.8461168Z contiguous=False, 2025-05-07T20:33:08.8461621Z compiled=False, 2025-05-07T20:33:08.8462036Z ) 2025-05-07T20:33:08.8462678Z self = 2025-05-07T20:33:08.8463711Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:08.8464289Z 2025-05-07T20:33:08.8464459Z @given( 2025-05-07T20:33:08.8464912Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.8465552Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.8466179Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.8466844Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.8467521Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.8468075Z ) 2025-05-07T20:33:08.8468428Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.8468877Z def test_silu_mul_quant( 2025-05-07T20:33:08.8469120Z self, 2025-05-07T20:33:08.8469311Z T: int, 2025-05-07T20:33:08.8469498Z D: int, 2025-05-07T20:33:08.8469713Z scale_ub: Optional[float], 2025-05-07T20:33:08.8469985Z contiguous: bool, 2025-05-07T20:33:08.8470217Z compiled: bool, 2025-05-07T20:33:08.8470436Z ) -> None: 2025-05-07T20:33:08.8470643Z torch.manual_seed(2025) 2025-05-07T20:33:08.8470884Z 2025-05-07T20:33:08.8471160Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.8471512Z 2025-05-07T20:33:08.8471697Z x_sign = torch.sign(x) 2025-05-07T20:33:08.8471988Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.8472308Z x = x_sign * x_clamp 2025-05-07T20:33:08.8472541Z x0 = x[:, :D] 2025-05-07T20:33:08.8472751Z x1 = x[:, D:] 2025-05-07T20:33:08.8472958Z 2025-05-07T20:33:08.8473137Z if contiguous: 2025-05-07T20:33:08.8473368Z x0 = x0.contiguous() 2025-05-07T20:33:08.8473630Z x1 = x1.contiguous() 2025-05-07T20:33:08.8473876Z 2025-05-07T20:33:08.8474061Z if scale_ub is not None: 2025-05-07T20:33:08.8474338Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.8474670Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.8474970Z ) 2025-05-07T20:33:08.8475267Z else: 2025-05-07T20:33:08.8475473Z scale_ub_tensor = None 2025-05-07T20:33:08.8475720Z 2025-05-07T20:33:08.8475952Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.8476263Z op = silu_mul_quant 2025-05-07T20:33:08.8476505Z if compiled: 2025-05-07T20:33:08.8476869Z op = torch.compile(op) 2025-05-07T20:33:08.8477172Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.8477447Z 2025-05-07T20:33:08.8477634Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.8477799Z 2025-05-07T20:33:08.8477957Z moe/activation_test.py:117: 2025-05-07T20:33:08.8478252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.8478589Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.8478871Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.8479591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.8480315Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.8480869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.8481583Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.8482284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.8482833Z kernel = self.compile( 2025-05-07T20:33:08.8483389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.8484080Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.8484482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.8484725Z 2025-05-07T20:33:08.8484936Z self = 2025-05-07T20:33:08.8486056Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.8487480Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f985783bec0>} 2025-05-07T20:33:08.8488881Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.8489956Z context = 2025-05-07T20:33:08.8490257Z 2025-05-07T20:33:08.8490423Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.8490957Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.8491437Z module_map=module_map) 2025-05-07T20:33:08.8491800Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.8492154Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.8492417Z E ^ 2025-05-07T20:33:08.8492890Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.8493365Z 2025-05-07T20:33:08.8493799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.8494463Z 2025-05-07T20:33:08.8494563Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.8494979Z self=, 2025-05-07T20:33:08.8495381Z T=1, 2025-05-07T20:33:08.8495562Z D=7168, 2025-05-07T20:33:08.8495752Z scale_ub=None, 2025-05-07T20:33:08.8495957Z contiguous=True, 2025-05-07T20:33:08.8496247Z compiled=False, 2025-05-07T20:33:08.8496448Z ) 2025-05-07T20:33:08.8496761Z self = 2025-05-07T20:33:08.8497258Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:08.8497528Z 2025-05-07T20:33:08.8497680Z @given( 2025-05-07T20:33:08.8497901Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:08.8498241Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:08.8498576Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:08.8498946Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:08.8499272Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:08.8499561Z ) 2025-05-07T20:33:08.8499911Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:08.8500357Z def test_silu_mul_quant( 2025-05-07T20:33:08.8500592Z self, 2025-05-07T20:33:08.8500782Z T: int, 2025-05-07T20:33:08.8500968Z D: int, 2025-05-07T20:33:08.8501182Z scale_ub: Optional[float], 2025-05-07T20:33:08.8501453Z contiguous: bool, 2025-05-07T20:33:08.8501684Z compiled: bool, 2025-05-07T20:33:08.8501904Z ) -> None: 2025-05-07T20:33:08.8502116Z torch.manual_seed(2025) 2025-05-07T20:33:08.8502355Z 2025-05-07T20:33:08.8502628Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:08.8502978Z 2025-05-07T20:33:08.8503169Z x_sign = torch.sign(x) 2025-05-07T20:33:08.8503456Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:08.8503768Z x = x_sign * x_clamp 2025-05-07T20:33:08.8504008Z x0 = x[:, :D] 2025-05-07T20:33:08.8504212Z x1 = x[:, D:] 2025-05-07T20:33:08.8504414Z 2025-05-07T20:33:08.8504597Z if contiguous: 2025-05-07T20:33:08.8504819Z x0 = x0.contiguous() 2025-05-07T20:33:08.8505075Z x1 = x1.contiguous() 2025-05-07T20:33:08.8505316Z 2025-05-07T20:33:08.8505499Z if scale_ub is not None: 2025-05-07T20:33:08.8505770Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:08.8506101Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:08.8506402Z ) 2025-05-07T20:33:08.8506602Z else: 2025-05-07T20:33:08.8506821Z scale_ub_tensor = None 2025-05-07T20:33:08.8507071Z 2025-05-07T20:33:08.8507313Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:08.8507634Z op = silu_mul_quant 2025-05-07T20:33:08.8507886Z if compiled: 2025-05-07T20:33:08.8514509Z op = torch.compile(op) 2025-05-07T20:33:08.8514831Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.8515116Z 2025-05-07T20:33:08.8515318Z > y_fp8, y_scale = fn() 2025-05-07T20:33:08.8515487Z 2025-05-07T20:33:08.8515597Z moe/activation_test.py:117: 2025-05-07T20:33:08.8515901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.8516253Z moe/activation_test.py:115: in fn 2025-05-07T20:33:08.8516538Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:08.8517259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:08.8517985Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:08.8518552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:08.8519272Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:08.8519967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:08.8520525Z kernel = self.compile( 2025-05-07T20:33:08.8521097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:08.8521864Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:08.8522288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:08.8522533Z 2025-05-07T20:33:08.8522747Z self = 2025-05-07T20:33:08.8523955Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:08.8525600Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857c73240>} 2025-05-07T20:33:08.8527016Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:08.8528109Z context = 2025-05-07T20:33:08.8528410Z 2025-05-07T20:33:08.8528590Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:08.8529196Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:08.8529682Z module_map=module_map) 2025-05-07T20:33:08.8530061Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:08.8530436Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:08.8530712Z E ^ 2025-05-07T20:33:08.8531199Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:08.8531677Z 2025-05-07T20:33:08.8532122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:08.8532668Z 2025-05-07T20:33:08.8532785Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:08.8533210Z self=, 2025-05-07T20:33:08.8533637Z T=16384, 2025-05-07T20:33:08.8533841Z D=7168, 2025-05-07T20:33:08.8534040Z scale_ub=1200.0, 2025-05-07T20:33:08.8534281Z contiguous=False, 2025-05-07T20:33:08.8534599Z compiled=True, 2025-05-07T20:33:09.0966620Z ) 2025-05-07T20:33:09.0967489Z self = 2025-05-07T20:33:09.0968965Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:09.0969403Z 2025-05-07T20:33:09.0969510Z @given( 2025-05-07T20:33:09.0969807Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.0970214Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.0970612Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.0970941Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.0971275Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.0971555Z ) 2025-05-07T20:33:09.0971902Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.0972357Z def test_silu_mul_quant( 2025-05-07T20:33:09.0972600Z self, 2025-05-07T20:33:09.0972792Z T: int, 2025-05-07T20:33:09.0972988Z D: int, 2025-05-07T20:33:09.0973200Z scale_ub: Optional[float], 2025-05-07T20:33:09.0973473Z contiguous: bool, 2025-05-07T20:33:09.0973708Z compiled: bool, 2025-05-07T20:33:09.0973929Z ) -> None: 2025-05-07T20:33:09.0974146Z torch.manual_seed(2025) 2025-05-07T20:33:09.0974492Z 2025-05-07T20:33:09.0974762Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.0975112Z 2025-05-07T20:33:09.0975301Z x_sign = torch.sign(x) 2025-05-07T20:33:09.0975587Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.0976044Z x = x_sign * x_clamp 2025-05-07T20:33:09.0976277Z x0 = x[:, :D] 2025-05-07T20:33:09.0976493Z x1 = x[:, D:] 2025-05-07T20:33:09.0976698Z 2025-05-07T20:33:09.0976875Z if contiguous: 2025-05-07T20:33:09.0977107Z x0 = x0.contiguous() 2025-05-07T20:33:09.0977492Z x1 = x1.contiguous() 2025-05-07T20:33:09.0977734Z 2025-05-07T20:33:09.0977924Z if scale_ub is not None: 2025-05-07T20:33:09.0978193Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.0978528Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.0978958Z ) 2025-05-07T20:33:09.0979149Z else: 2025-05-07T20:33:09.0979357Z scale_ub_tensor = None 2025-05-07T20:33:09.0979606Z 2025-05-07T20:33:09.0979834Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.0980151Z op = silu_mul_quant 2025-05-07T20:33:09.0980394Z if compiled: 2025-05-07T20:33:09.0980645Z op = torch.compile(op) 2025-05-07T20:33:09.0980942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.0981214Z 2025-05-07T20:33:09.0981403Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.0981569Z 2025-05-07T20:33:09.0981668Z moe/activation_test.py:117: 2025-05-07T20:33:09.0981964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.0982297Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.0982580Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.0983162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:09.0983744Z return fn(*args, **kwargs) 2025-05-07T20:33:09.0984427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.0985146Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.0985698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.0986413Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.0987102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.0987658Z kernel = self.compile( 2025-05-07T20:33:09.0988214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.0988897Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.0989303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.0989539Z 2025-05-07T20:33:09.0989747Z self = 2025-05-07T20:33:09.0990860Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.0992289Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857c71620>} 2025-05-07T20:33:09.0993695Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.0994777Z context = 2025-05-07T20:33:09.0995072Z 2025-05-07T20:33:09.0995239Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.0995775Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.0996250Z module_map=module_map) 2025-05-07T20:33:09.0996674Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.0997029Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.0997292Z E ^ 2025-05-07T20:33:09.0997766Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.0998344Z 2025-05-07T20:33:09.0998791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.0999383Z 2025-05-07T20:33:09.0999486Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.0999948Z self=, 2025-05-07T20:33:09.1000361Z T=1, 2025-05-07T20:33:09.1000541Z D=7168, 2025-05-07T20:33:09.1000739Z scale_ub=None, 2025-05-07T20:33:09.1000959Z contiguous=False, 2025-05-07T20:33:09.1001181Z compiled=False, 2025-05-07T20:33:09.1001384Z ) 2025-05-07T20:33:09.1001707Z self = 2025-05-07T20:33:09.1002209Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:09.1002485Z 2025-05-07T20:33:09.1002564Z @given( 2025-05-07T20:33:09.1002794Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.1003118Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.1003428Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.1003764Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.1004101Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.1004391Z ) 2025-05-07T20:33:09.1004748Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.1005205Z def test_silu_mul_quant( 2025-05-07T20:33:09.1005449Z self, 2025-05-07T20:33:09.1005643Z T: int, 2025-05-07T20:33:09.1005842Z D: int, 2025-05-07T20:33:09.1006061Z scale_ub: Optional[float], 2025-05-07T20:33:09.1006342Z contiguous: bool, 2025-05-07T20:33:09.1006584Z compiled: bool, 2025-05-07T20:33:09.1006804Z ) -> None: 2025-05-07T20:33:09.1007025Z torch.manual_seed(2025) 2025-05-07T20:33:09.1007266Z 2025-05-07T20:33:09.1007549Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.1007899Z 2025-05-07T20:33:09.1008095Z x_sign = torch.sign(x) 2025-05-07T20:33:09.1008387Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.1008702Z x = x_sign * x_clamp 2025-05-07T20:33:09.1008992Z x0 = x[:, :D] 2025-05-07T20:33:09.1009214Z x1 = x[:, D:] 2025-05-07T20:33:09.1009421Z 2025-05-07T20:33:09.1009608Z if contiguous: 2025-05-07T20:33:09.1009845Z x0 = x0.contiguous() 2025-05-07T20:33:09.1010101Z x1 = x1.contiguous() 2025-05-07T20:33:09.1010348Z 2025-05-07T20:33:09.1010545Z if scale_ub is not None: 2025-05-07T20:33:09.1010821Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.1011164Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.1011476Z ) 2025-05-07T20:33:09.1011669Z else: 2025-05-07T20:33:09.1011877Z scale_ub_tensor = None 2025-05-07T20:33:09.1012126Z 2025-05-07T20:33:09.1012362Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.1012670Z op = silu_mul_quant 2025-05-07T20:33:09.1012916Z if compiled: 2025-05-07T20:33:09.1013156Z op = torch.compile(op) 2025-05-07T20:33:09.1013452Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.1013726Z 2025-05-07T20:33:09.1013914Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.1014077Z 2025-05-07T20:33:09.1014172Z moe/activation_test.py:117: 2025-05-07T20:33:09.1014544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.1014884Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.1015212Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.1015925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.1016651Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.1017281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.1017997Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.1018698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.1019293Z kernel = self.compile( 2025-05-07T20:33:09.1019861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.1020554Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.1020961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.1021207Z 2025-05-07T20:33:09.1021416Z self = 2025-05-07T20:33:09.1022544Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.1023968Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857c73ba0>} 2025-05-07T20:33:09.1025374Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.1026790Z context = 2025-05-07T20:33:09.1027090Z 2025-05-07T20:33:09.1027260Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.1027790Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.1028263Z module_map=module_map) 2025-05-07T20:33:09.1028631Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.1028999Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.1029287Z E ^ 2025-05-07T20:33:09.1029757Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.1030228Z 2025-05-07T20:33:09.1030667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.1031207Z 2025-05-07T20:33:09.1031309Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.1031727Z self=, 2025-05-07T20:33:09.1032139Z T=2048, 2025-05-07T20:33:09.1032319Z D=7168, 2025-05-07T20:33:09.1032511Z scale_ub=None, 2025-05-07T20:33:09.1032722Z contiguous=False, 2025-05-07T20:33:09.1032937Z compiled=True, 2025-05-07T20:33:09.1033129Z ) 2025-05-07T20:33:09.1913082Z self = 2025-05-07T20:33:09.1913883Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:09.1914280Z 2025-05-07T20:33:09.1914399Z @given( 2025-05-07T20:33:09.1914717Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.1915049Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.1915372Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.1915714Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.1916044Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.1916333Z ) 2025-05-07T20:33:09.1916855Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.1917344Z def test_silu_mul_quant( 2025-05-07T20:33:09.1917588Z self, 2025-05-07T20:33:09.1917775Z T: int, 2025-05-07T20:33:09.1917972Z D: int, 2025-05-07T20:33:09.1918195Z scale_ub: Optional[float], 2025-05-07T20:33:09.1918650Z contiguous: bool, 2025-05-07T20:33:09.1918894Z compiled: bool, 2025-05-07T20:33:09.1919119Z ) -> None: 2025-05-07T20:33:09.1919329Z torch.manual_seed(2025) 2025-05-07T20:33:09.1919574Z 2025-05-07T20:33:09.1919930Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.1920277Z 2025-05-07T20:33:09.1920469Z x_sign = torch.sign(x) 2025-05-07T20:33:09.1920762Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.1921078Z x = x_sign * x_clamp 2025-05-07T20:33:09.1921323Z x0 = x[:, :D] 2025-05-07T20:33:09.1921537Z x1 = x[:, D:] 2025-05-07T20:33:09.1921749Z 2025-05-07T20:33:09.1921944Z if contiguous: 2025-05-07T20:33:09.1922189Z x0 = x0.contiguous() 2025-05-07T20:33:09.1922461Z x1 = x1.contiguous() 2025-05-07T20:33:09.1922706Z 2025-05-07T20:33:09.1922904Z if scale_ub is not None: 2025-05-07T20:33:09.1923194Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.1923540Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.1923862Z ) 2025-05-07T20:33:09.1924067Z else: 2025-05-07T20:33:09.1924280Z scale_ub_tensor = None 2025-05-07T20:33:09.1924551Z 2025-05-07T20:33:09.1924791Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.1925104Z op = silu_mul_quant 2025-05-07T20:33:09.1925359Z if compiled: 2025-05-07T20:33:09.1925943Z op = torch.compile(op) 2025-05-07T20:33:09.1926243Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.1926527Z 2025-05-07T20:33:09.1926720Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.1926887Z 2025-05-07T20:33:09.1926989Z moe/activation_test.py:117: 2025-05-07T20:33:09.1927284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.1927627Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.1927921Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.1928502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:09.1929095Z return fn(*args, **kwargs) 2025-05-07T20:33:09.1929785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.1930508Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.1931060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.1931773Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.1932472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.1933023Z kernel = self.compile( 2025-05-07T20:33:09.1933590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.1934282Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.1934846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.1935087Z 2025-05-07T20:33:09.1935295Z self = 2025-05-07T20:33:09.1936417Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.1937947Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f99436e2ca0>} 2025-05-07T20:33:09.1939507Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.1940607Z context = 2025-05-07T20:33:09.1940910Z 2025-05-07T20:33:09.1941081Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.1941688Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.1942180Z module_map=module_map) 2025-05-07T20:33:09.1942554Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.1942918Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.1943193Z E ^ 2025-05-07T20:33:09.1943666Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.1944143Z 2025-05-07T20:33:09.1944588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.1945142Z 2025-05-07T20:33:09.1945250Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.1945680Z self=, 2025-05-07T20:33:09.1946099Z T=4096, 2025-05-07T20:33:09.1946300Z D=7168, 2025-05-07T20:33:09.1946500Z scale_ub=None, 2025-05-07T20:33:09.1946718Z contiguous=False, 2025-05-07T20:33:09.1946952Z compiled=True, 2025-05-07T20:33:09.1947160Z ) 2025-05-07T20:33:09.1947491Z self = 2025-05-07T20:33:09.1948006Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:09.1948305Z 2025-05-07T20:33:09.1948385Z @given( 2025-05-07T20:33:09.1948618Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.1948936Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.1949256Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.1949604Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.1949942Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.1950242Z ) 2025-05-07T20:33:09.1950605Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.1951072Z def test_silu_mul_quant( 2025-05-07T20:33:09.1951315Z self, 2025-05-07T20:33:09.1951517Z T: int, 2025-05-07T20:33:09.1951722Z D: int, 2025-05-07T20:33:09.1951941Z scale_ub: Optional[float], 2025-05-07T20:33:09.1952220Z contiguous: bool, 2025-05-07T20:33:09.1952470Z compiled: bool, 2025-05-07T20:33:09.1952696Z ) -> None: 2025-05-07T20:33:09.1952916Z torch.manual_seed(2025) 2025-05-07T20:33:09.1953161Z 2025-05-07T20:33:09.1953437Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.1953799Z 2025-05-07T20:33:09.1954005Z x_sign = torch.sign(x) 2025-05-07T20:33:09.1954303Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.1954628Z x = x_sign * x_clamp 2025-05-07T20:33:09.1954876Z x0 = x[:, :D] 2025-05-07T20:33:09.1955091Z x1 = x[:, D:] 2025-05-07T20:33:09.1955306Z 2025-05-07T20:33:09.1955499Z if contiguous: 2025-05-07T20:33:09.1955731Z x0 = x0.contiguous() 2025-05-07T20:33:09.1956004Z x1 = x1.contiguous() 2025-05-07T20:33:09.1956260Z 2025-05-07T20:33:09.1956464Z if scale_ub is not None: 2025-05-07T20:33:09.1956739Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.1957086Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.1957458Z ) 2025-05-07T20:33:09.1957653Z else: 2025-05-07T20:33:09.1957864Z scale_ub_tensor = None 2025-05-07T20:33:09.1958123Z 2025-05-07T20:33:09.1958352Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.1958674Z op = silu_mul_quant 2025-05-07T20:33:09.1959034Z if compiled: 2025-05-07T20:33:09.1959280Z op = torch.compile(op) 2025-05-07T20:33:09.1959587Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.1959865Z 2025-05-07T20:33:09.1960051Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.1960263Z 2025-05-07T20:33:09.1960360Z moe/activation_test.py:117: 2025-05-07T20:33:09.1960660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.1961003Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.1961286Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.1961869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:09.1962464Z return fn(*args, **kwargs) 2025-05-07T20:33:09.1963148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.1963882Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.1964445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.1965161Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.1965854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.1966414Z kernel = self.compile( 2025-05-07T20:33:09.1966982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.1967667Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.1968082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.1968327Z 2025-05-07T20:33:09.1968536Z self = 2025-05-07T20:33:09.1969661Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.1971084Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9942a4f240>} 2025-05-07T20:33:09.1972681Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.1973769Z context = 2025-05-07T20:33:09.1974073Z 2025-05-07T20:33:09.1974242Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.1974937Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.1975422Z module_map=module_map) 2025-05-07T20:33:09.1975794Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.1976155Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.1976434Z E ^ 2025-05-07T20:33:09.1977002Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.1977483Z 2025-05-07T20:33:09.1977922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.1978481Z 2025-05-07T20:33:09.3588843Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.3589680Z self=, 2025-05-07T20:33:09.3590391Z T=16384, 2025-05-07T20:33:09.3590721Z D=5120, 2025-05-07T20:33:09.3591206Z scale_ub=1200.0, 2025-05-07T20:33:09.3599176Z contiguous=False, 2025-05-07T20:33:09.3599418Z compiled=False, 2025-05-07T20:33:09.3599790Z ) 2025-05-07T20:33:09.3600159Z self = 2025-05-07T20:33:09.3600750Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:09.3601143Z 2025-05-07T20:33:09.3601225Z @given( 2025-05-07T20:33:09.3601479Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.3601835Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.3602176Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.3602556Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.3602932Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.3603253Z ) 2025-05-07T20:33:09.3603660Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.3604185Z def test_silu_mul_quant( 2025-05-07T20:33:09.3604454Z self, 2025-05-07T20:33:09.3604661Z T: int, 2025-05-07T20:33:09.3604883Z D: int, 2025-05-07T20:33:09.3605124Z scale_ub: Optional[float], 2025-05-07T20:33:09.3605423Z contiguous: bool, 2025-05-07T20:33:09.3605690Z compiled: bool, 2025-05-07T20:33:09.3605932Z ) -> None: 2025-05-07T20:33:09.3606147Z torch.manual_seed(2025) 2025-05-07T20:33:09.3606398Z 2025-05-07T20:33:09.3606680Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.3607032Z 2025-05-07T20:33:09.3607231Z x_sign = torch.sign(x) 2025-05-07T20:33:09.3607530Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.3607842Z x = x_sign * x_clamp 2025-05-07T20:33:09.3608091Z x0 = x[:, :D] 2025-05-07T20:33:09.3608310Z x1 = x[:, D:] 2025-05-07T20:33:09.3608513Z 2025-05-07T20:33:09.3608699Z if contiguous: 2025-05-07T20:33:09.3608937Z x0 = x0.contiguous() 2025-05-07T20:33:09.3609205Z x1 = x1.contiguous() 2025-05-07T20:33:09.3609444Z 2025-05-07T20:33:09.3609646Z if scale_ub is not None: 2025-05-07T20:33:09.3609921Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.3610258Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.3610578Z ) 2025-05-07T20:33:09.3610767Z else: 2025-05-07T20:33:09.3610976Z scale_ub_tensor = None 2025-05-07T20:33:09.3611233Z 2025-05-07T20:33:09.3611472Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.3611788Z op = silu_mul_quant 2025-05-07T20:33:09.3612046Z if compiled: 2025-05-07T20:33:09.3612302Z op = torch.compile(op) 2025-05-07T20:33:09.3612602Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.3612888Z 2025-05-07T20:33:09.3613085Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.3613251Z 2025-05-07T20:33:09.3613351Z moe/activation_test.py:117: 2025-05-07T20:33:09.3613656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.3614003Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.3614297Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.3615189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.3615925Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.3616490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.3617206Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.3617908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.3618538Z kernel = self.compile( 2025-05-07T20:33:09.3619112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.3619882Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.3620308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.3620559Z 2025-05-07T20:33:09.3620776Z self = 2025-05-07T20:33:09.3621953Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.3623384Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9942a4e840>} 2025-05-07T20:33:09.3624802Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.3626196Z context = 2025-05-07T20:33:09.3626498Z 2025-05-07T20:33:09.3626678Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.3627222Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.3627706Z module_map=module_map) 2025-05-07T20:33:09.3628102Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.3628468Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.3628735Z E ^ 2025-05-07T20:33:09.3629209Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.3629690Z 2025-05-07T20:33:09.3630126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.3630674Z 2025-05-07T20:33:09.3630777Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.3631201Z self=, 2025-05-07T20:33:09.3631614Z T=16384, 2025-05-07T20:33:09.3631813Z D=5120, 2025-05-07T20:33:09.3632008Z scale_ub=1200.0, 2025-05-07T20:33:09.3632231Z contiguous=True, 2025-05-07T20:33:09.3632459Z compiled=True, 2025-05-07T20:33:09.3632664Z ) 2025-05-07T20:33:09.3632986Z self = 2025-05-07T20:33:09.3633498Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:09.3633783Z 2025-05-07T20:33:09.3633869Z @given( 2025-05-07T20:33:09.3634092Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.3634426Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.3634744Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.3635082Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.3635414Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.3635706Z ) 2025-05-07T20:33:09.3636064Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.3636514Z def test_silu_mul_quant( 2025-05-07T20:33:09.3636766Z self, 2025-05-07T20:33:09.3636966Z T: int, 2025-05-07T20:33:09.3637160Z D: int, 2025-05-07T20:33:09.3637384Z scale_ub: Optional[float], 2025-05-07T20:33:09.3637660Z contiguous: bool, 2025-05-07T20:33:09.3637898Z compiled: bool, 2025-05-07T20:33:09.3638127Z ) -> None: 2025-05-07T20:33:09.3638349Z torch.manual_seed(2025) 2025-05-07T20:33:09.3638590Z 2025-05-07T20:33:09.3638963Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.3639331Z 2025-05-07T20:33:09.3639537Z x_sign = torch.sign(x) 2025-05-07T20:33:09.3639840Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.3640172Z x = x_sign * x_clamp 2025-05-07T20:33:09.3640544Z x0 = x[:, :D] 2025-05-07T20:33:09.3640769Z x1 = x[:, D:] 2025-05-07T20:33:09.3640988Z 2025-05-07T20:33:09.3641188Z if contiguous: 2025-05-07T20:33:09.3641422Z x0 = x0.contiguous() 2025-05-07T20:33:09.3641780Z x1 = x1.contiguous() 2025-05-07T20:33:09.3642032Z 2025-05-07T20:33:09.3642226Z if scale_ub is not None: 2025-05-07T20:33:09.3642512Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.3642857Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.3643169Z ) 2025-05-07T20:33:09.3643372Z else: 2025-05-07T20:33:09.3643589Z scale_ub_tensor = None 2025-05-07T20:33:09.3643852Z 2025-05-07T20:33:09.3644092Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.3644424Z op = silu_mul_quant 2025-05-07T20:33:09.3644676Z if compiled: 2025-05-07T20:33:09.3644931Z op = torch.compile(op) 2025-05-07T20:33:09.3645245Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.3645534Z 2025-05-07T20:33:09.3645729Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.3645905Z 2025-05-07T20:33:09.3646009Z moe/activation_test.py:117: 2025-05-07T20:33:09.3646315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.3646657Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.3646954Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.3647545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:09.3648135Z return fn(*args, **kwargs) 2025-05-07T20:33:09.3648837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.3649575Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.3650157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.3650879Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.3651589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.3653652Z kernel = self.compile( 2025-05-07T20:33:09.3654230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.3655035Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.3655461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.3655705Z 2025-05-07T20:33:09.3655929Z self = 2025-05-07T20:33:09.3657068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.3658502Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9943fc6ca0>} 2025-05-07T20:33:09.3659924Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.3661016Z context = 2025-05-07T20:33:09.3661317Z 2025-05-07T20:33:09.3661497Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.3662093Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.3662585Z module_map=module_map) 2025-05-07T20:33:09.3663046Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.3663418Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.3663683Z E ^ 2025-05-07T20:33:09.3664170Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.3664685Z 2025-05-07T20:33:09.3665130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.3665672Z 2025-05-07T20:33:09.5382476Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.5383830Z self=, 2025-05-07T20:33:09.5384755Z T=16384, 2025-05-07T20:33:09.5385154Z D=5120, 2025-05-07T20:33:09.5385521Z scale_ub=None, 2025-05-07T20:33:09.5385947Z contiguous=False, 2025-05-07T20:33:09.5386400Z compiled=True, 2025-05-07T20:33:09.5386790Z ) 2025-05-07T20:33:09.5387432Z self = 2025-05-07T20:33:09.5388361Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:09.5388696Z 2025-05-07T20:33:09.5388778Z @given( 2025-05-07T20:33:09.5389018Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.5389347Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.5389657Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.5390000Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.5390343Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.5390638Z ) 2025-05-07T20:33:09.5390991Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.5391464Z def test_silu_mul_quant( 2025-05-07T20:33:09.5391717Z self, 2025-05-07T20:33:09.5391907Z T: int, 2025-05-07T20:33:09.5392104Z D: int, 2025-05-07T20:33:09.5392327Z scale_ub: Optional[float], 2025-05-07T20:33:09.5392606Z contiguous: bool, 2025-05-07T20:33:09.5392853Z compiled: bool, 2025-05-07T20:33:09.5393079Z ) -> None: 2025-05-07T20:33:09.5393298Z torch.manual_seed(2025) 2025-05-07T20:33:09.5393543Z 2025-05-07T20:33:09.5393819Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.5394177Z 2025-05-07T20:33:09.5394375Z x_sign = torch.sign(x) 2025-05-07T20:33:09.5394660Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.5394980Z x = x_sign * x_clamp 2025-05-07T20:33:09.5395224Z x0 = x[:, :D] 2025-05-07T20:33:09.5395436Z x1 = x[:, D:] 2025-05-07T20:33:09.5395652Z 2025-05-07T20:33:09.5395843Z if contiguous: 2025-05-07T20:33:09.5396075Z x0 = x0.contiguous() 2025-05-07T20:33:09.5396341Z x1 = x1.contiguous() 2025-05-07T20:33:09.5396589Z 2025-05-07T20:33:09.5396774Z if scale_ub is not None: 2025-05-07T20:33:09.5397051Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.5397398Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.5397702Z ) 2025-05-07T20:33:09.5397887Z else: 2025-05-07T20:33:09.5398102Z scale_ub_tensor = None 2025-05-07T20:33:09.5398364Z 2025-05-07T20:33:09.5398593Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.5398917Z op = silu_mul_quant 2025-05-07T20:33:09.5399170Z if compiled: 2025-05-07T20:33:09.5399415Z op = torch.compile(op) 2025-05-07T20:33:09.5399719Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.5400001Z 2025-05-07T20:33:09.5400189Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.5400473Z 2025-05-07T20:33:09.5400575Z moe/activation_test.py:117: 2025-05-07T20:33:09.5400872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.5401203Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.5401611Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.5402203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:09.5402793Z return fn(*args, **kwargs) 2025-05-07T20:33:09.5403475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.5404268Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.5404832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.5405546Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.5406250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.5406814Z kernel = self.compile( 2025-05-07T20:33:09.5407385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.5408077Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.5408495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.5408734Z 2025-05-07T20:33:09.5408983Z self = 2025-05-07T20:33:09.5410139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.5411651Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857b74b80>} 2025-05-07T20:33:09.5413067Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.5414156Z context = 2025-05-07T20:33:09.5414592Z 2025-05-07T20:33:09.5414764Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.5415307Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.5415786Z module_map=module_map) 2025-05-07T20:33:09.5416161Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.5416528Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.5416792Z E ^ 2025-05-07T20:33:09.5417270Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.5417744Z 2025-05-07T20:33:09.5418182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.5418730Z 2025-05-07T20:33:09.5418838Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.5419253Z self=, 2025-05-07T20:33:09.5419668Z T=2048, 2025-05-07T20:33:09.5419867Z D=5120, 2025-05-07T20:33:09.5420066Z scale_ub=None, 2025-05-07T20:33:09.5420283Z contiguous=False, 2025-05-07T20:33:09.5420514Z compiled=True, 2025-05-07T20:33:09.5420719Z ) 2025-05-07T20:33:09.6346053Z self = 2025-05-07T20:33:09.6346607Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:09.6346906Z 2025-05-07T20:33:09.6347090Z @given( 2025-05-07T20:33:09.6347334Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.6347655Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.6347977Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.6348443Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.6348777Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.6349073Z ) 2025-05-07T20:33:09.6349429Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.6349949Z def test_silu_mul_quant( 2025-05-07T20:33:09.6350189Z self, 2025-05-07T20:33:09.6350388Z T: int, 2025-05-07T20:33:09.6350586Z D: int, 2025-05-07T20:33:09.6350802Z scale_ub: Optional[float], 2025-05-07T20:33:09.6351078Z contiguous: bool, 2025-05-07T20:33:09.6351322Z compiled: bool, 2025-05-07T20:33:09.6351546Z ) -> None: 2025-05-07T20:33:09.6351767Z torch.manual_seed(2025) 2025-05-07T20:33:09.6352017Z 2025-05-07T20:33:09.6352293Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.6352650Z 2025-05-07T20:33:09.6352845Z x_sign = torch.sign(x) 2025-05-07T20:33:09.6353136Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.6353461Z x = x_sign * x_clamp 2025-05-07T20:33:09.6353704Z x0 = x[:, :D] 2025-05-07T20:33:09.6353922Z x1 = x[:, D:] 2025-05-07T20:33:09.6354134Z 2025-05-07T20:33:09.6354323Z if contiguous: 2025-05-07T20:33:09.6354563Z x0 = x0.contiguous() 2025-05-07T20:33:09.6354822Z x1 = x1.contiguous() 2025-05-07T20:33:09.6355068Z 2025-05-07T20:33:09.6355266Z if scale_ub is not None: 2025-05-07T20:33:09.6355538Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.6355878Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.6356196Z ) 2025-05-07T20:33:09.6356392Z else: 2025-05-07T20:33:09.6356604Z scale_ub_tensor = None 2025-05-07T20:33:09.6356864Z 2025-05-07T20:33:09.6357094Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.6357443Z op = silu_mul_quant 2025-05-07T20:33:09.6357698Z if compiled: 2025-05-07T20:33:09.6357958Z op = torch.compile(op) 2025-05-07T20:33:09.6358256Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.6358539Z 2025-05-07T20:33:09.6358731Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.6358896Z 2025-05-07T20:33:09.6358997Z moe/activation_test.py:117: 2025-05-07T20:33:09.6359305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.6359657Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.6359950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.6360535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:09.6361128Z return fn(*args, **kwargs) 2025-05-07T20:33:09.6361821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.6362541Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.6363106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.6363825Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.6364523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.6365081Z kernel = self.compile( 2025-05-07T20:33:09.6365649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.6366341Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.6366754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.6367046Z 2025-05-07T20:33:09.6367255Z self = 2025-05-07T20:33:09.6368486Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.6369916Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857b760c0>} 2025-05-07T20:33:09.6371375Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.6372463Z context = 2025-05-07T20:33:09.6372775Z 2025-05-07T20:33:09.6372949Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.6373494Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.6373986Z module_map=module_map) 2025-05-07T20:33:09.6374445Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.6374815Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.6375093Z E ^ 2025-05-07T20:33:09.6375575Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.6376057Z 2025-05-07T20:33:09.6376500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.6377049Z 2025-05-07T20:33:09.6377158Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.6377589Z self=, 2025-05-07T20:33:09.6378016Z T=2048, 2025-05-07T20:33:09.6378215Z D=5120, 2025-05-07T20:33:09.6378409Z scale_ub=1200.0, 2025-05-07T20:33:09.6378629Z contiguous=False, 2025-05-07T20:33:09.6378858Z compiled=True, 2025-05-07T20:33:09.6379064Z ) 2025-05-07T20:33:09.6379391Z self = 2025-05-07T20:33:09.6379909Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:09.6380202Z 2025-05-07T20:33:09.6380279Z @given( 2025-05-07T20:33:09.6380521Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.6380836Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.6381151Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.6381488Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.6381822Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.6382120Z ) 2025-05-07T20:33:09.6382485Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.6382938Z def test_silu_mul_quant( 2025-05-07T20:33:09.6383190Z self, 2025-05-07T20:33:09.6383386Z T: int, 2025-05-07T20:33:09.6383584Z D: int, 2025-05-07T20:33:09.6383808Z scale_ub: Optional[float], 2025-05-07T20:33:09.6384089Z contiguous: bool, 2025-05-07T20:33:09.6384334Z compiled: bool, 2025-05-07T20:33:09.6384554Z ) -> None: 2025-05-07T20:33:09.6384773Z torch.manual_seed(2025) 2025-05-07T20:33:09.6385019Z 2025-05-07T20:33:09.6385294Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.6385648Z 2025-05-07T20:33:09.6385841Z x_sign = torch.sign(x) 2025-05-07T20:33:09.6386129Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.6386447Z x = x_sign * x_clamp 2025-05-07T20:33:09.6386693Z x0 = x[:, :D] 2025-05-07T20:33:09.6386977Z x1 = x[:, D:] 2025-05-07T20:33:09.6387184Z 2025-05-07T20:33:09.6387374Z if contiguous: 2025-05-07T20:33:09.6387614Z x0 = x0.contiguous() 2025-05-07T20:33:09.6387872Z x1 = x1.contiguous() 2025-05-07T20:33:09.6388122Z 2025-05-07T20:33:09.6388391Z if scale_ub is not None: 2025-05-07T20:33:09.6388845Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.6395984Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.6396314Z ) 2025-05-07T20:33:09.6396514Z else: 2025-05-07T20:33:09.6396845Z scale_ub_tensor = None 2025-05-07T20:33:09.6397104Z 2025-05-07T20:33:09.6397347Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.6397674Z op = silu_mul_quant 2025-05-07T20:33:09.6397923Z if compiled: 2025-05-07T20:33:09.6398173Z op = torch.compile(op) 2025-05-07T20:33:09.6398477Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.6398764Z 2025-05-07T20:33:09.6398960Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.6399130Z 2025-05-07T20:33:09.6399237Z moe/activation_test.py:117: 2025-05-07T20:33:09.6399541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.6399884Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.6400174Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.6400758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:09.6401337Z return fn(*args, **kwargs) 2025-05-07T20:33:09.6402028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.6402749Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.6403308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.6404017Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.6404713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.6405271Z kernel = self.compile( 2025-05-07T20:33:09.6405837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.6406525Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.6406933Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.6407172Z 2025-05-07T20:33:09.6407388Z self = 2025-05-07T20:33:09.6408506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.6409945Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857b772e0>} 2025-05-07T20:33:09.6411367Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.6412450Z context = 2025-05-07T20:33:09.6412749Z 2025-05-07T20:33:09.6412928Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.6413460Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.6413941Z module_map=module_map) 2025-05-07T20:33:09.6414316Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.6414764Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.6415101Z E ^ 2025-05-07T20:33:09.6415579Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.6416052Z 2025-05-07T20:33:09.6416572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.6417117Z 2025-05-07T20:33:09.8174323Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.8174800Z self=, 2025-05-07T20:33:09.8175331Z T=4096, 2025-05-07T20:33:09.8175525Z D=5120, 2025-05-07T20:33:09.8175714Z scale_ub=1200.0, 2025-05-07T20:33:09.8175931Z contiguous=True, 2025-05-07T20:33:09.8176148Z compiled=True, 2025-05-07T20:33:09.8176356Z ) 2025-05-07T20:33:09.8176675Z self = 2025-05-07T20:33:09.8177187Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:09.8177480Z 2025-05-07T20:33:09.8177581Z @given( 2025-05-07T20:33:09.8177810Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:09.8178128Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:09.8178435Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:09.8178782Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:09.8179115Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:09.8179397Z ) 2025-05-07T20:33:09.8179746Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:09.8180208Z def test_silu_mul_quant( 2025-05-07T20:33:09.8180446Z self, 2025-05-07T20:33:09.8180644Z T: int, 2025-05-07T20:33:09.8180843Z D: int, 2025-05-07T20:33:09.8181055Z scale_ub: Optional[float], 2025-05-07T20:33:09.8181328Z contiguous: bool, 2025-05-07T20:33:09.8181568Z compiled: bool, 2025-05-07T20:33:09.8181793Z ) -> None: 2025-05-07T20:33:09.8182004Z torch.manual_seed(2025) 2025-05-07T20:33:09.8182248Z 2025-05-07T20:33:09.8182524Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:09.8182867Z 2025-05-07T20:33:09.8183057Z x_sign = torch.sign(x) 2025-05-07T20:33:09.8183352Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:09.8183658Z x = x_sign * x_clamp 2025-05-07T20:33:09.8183899Z x0 = x[:, :D] 2025-05-07T20:33:09.8184122Z x1 = x[:, D:] 2025-05-07T20:33:09.8184319Z 2025-05-07T20:33:09.8184511Z if contiguous: 2025-05-07T20:33:09.8184740Z x0 = x0.contiguous() 2025-05-07T20:33:09.8184992Z x1 = x1.contiguous() 2025-05-07T20:33:09.8185235Z 2025-05-07T20:33:09.8185422Z if scale_ub is not None: 2025-05-07T20:33:09.8185690Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:09.8186028Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:09.8186344Z ) 2025-05-07T20:33:09.8186536Z else: 2025-05-07T20:33:09.8186740Z scale_ub_tensor = None 2025-05-07T20:33:09.8186998Z 2025-05-07T20:33:09.8187230Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:09.8187547Z op = silu_mul_quant 2025-05-07T20:33:09.8187801Z if compiled: 2025-05-07T20:33:09.8188049Z op = torch.compile(op) 2025-05-07T20:33:09.8188341Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.8188616Z 2025-05-07T20:33:09.8188808Z > y_fp8, y_scale = fn() 2025-05-07T20:33:09.8188972Z 2025-05-07T20:33:09.8189069Z moe/activation_test.py:117: 2025-05-07T20:33:09.8189367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.8189705Z moe/activation_test.py:115: in fn 2025-05-07T20:33:09.8189979Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:09.8190559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:09.8191225Z return fn(*args, **kwargs) 2025-05-07T20:33:09.8191912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:09.8192746Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:09.8193314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:09.8194033Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:09.8194775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:09.8195332Z kernel = self.compile( 2025-05-07T20:33:09.8195900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:09.8196597Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:09.8197002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:09.8197247Z 2025-05-07T20:33:09.8197455Z self = 2025-05-07T20:33:09.8198580Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:09.8200003Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98570dc860>} 2025-05-07T20:33:09.8201412Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:09.8202494Z context = 2025-05-07T20:33:09.8202804Z 2025-05-07T20:33:09.8202976Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:09.8203520Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:09.8204014Z module_map=module_map) 2025-05-07T20:33:09.8204385Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:09.8204747Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:09.8205017Z E ^ 2025-05-07T20:33:09.8205499Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:09.8205980Z 2025-05-07T20:33:09.8206419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:09.8206967Z 2025-05-07T20:33:09.8207072Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:09.8207503Z self=, 2025-05-07T20:33:09.8207918Z T=128, 2025-05-07T20:33:09.8208112Z D=5120, 2025-05-07T20:33:09.8208302Z scale_ub=1200.0, 2025-05-07T20:33:09.8208519Z contiguous=False, 2025-05-07T20:33:09.8208747Z compiled=True, 2025-05-07T20:33:09.8208951Z ) 2025-05-07T20:33:10.1359934Z self = 2025-05-07T20:33:10.1360513Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:10.1360805Z 2025-05-07T20:33:10.1360899Z @given( 2025-05-07T20:33:10.1361148Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1361601Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1361965Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1362306Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1362647Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1363051Z ) 2025-05-07T20:33:10.1363398Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1363853Z def test_silu_mul_quant( 2025-05-07T20:33:10.1364098Z self, 2025-05-07T20:33:10.1364285Z T: int, 2025-05-07T20:33:10.1364626Z D: int, 2025-05-07T20:33:10.1364845Z scale_ub: Optional[float], 2025-05-07T20:33:10.1365114Z contiguous: bool, 2025-05-07T20:33:10.1365357Z compiled: bool, 2025-05-07T20:33:10.1365585Z ) -> None: 2025-05-07T20:33:10.1365875Z torch.manual_seed(2025) 2025-05-07T20:33:10.1366128Z 2025-05-07T20:33:10.1366417Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1366773Z 2025-05-07T20:33:10.1366973Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1367279Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1367609Z x = x_sign * x_clamp 2025-05-07T20:33:10.1367867Z x0 = x[:, :D] 2025-05-07T20:33:10.1368094Z x1 = x[:, D:] 2025-05-07T20:33:10.1368306Z 2025-05-07T20:33:10.1368518Z if contiguous: 2025-05-07T20:33:10.1368788Z x0 = x0.contiguous() 2025-05-07T20:33:10.1369061Z x1 = x1.contiguous() 2025-05-07T20:33:10.1369308Z 2025-05-07T20:33:10.1369518Z if scale_ub is not None: 2025-05-07T20:33:10.1369806Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1370148Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1370473Z ) 2025-05-07T20:33:10.1370679Z else: 2025-05-07T20:33:10.1370899Z scale_ub_tensor = None 2025-05-07T20:33:10.1371157Z 2025-05-07T20:33:10.1371398Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1371724Z op = silu_mul_quant 2025-05-07T20:33:10.1371989Z if compiled: 2025-05-07T20:33:10.1372244Z op = torch.compile(op) 2025-05-07T20:33:10.1372552Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1372838Z 2025-05-07T20:33:10.1373035Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1373207Z 2025-05-07T20:33:10.1373310Z moe/activation_test.py:117: 2025-05-07T20:33:10.1373618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1373969Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1374258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1374953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1375551Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1376248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1376971Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1377534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1378256Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1378955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1379516Z kernel = self.compile( 2025-05-07T20:33:10.1380090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1380786Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1381198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1381442Z 2025-05-07T20:33:10.1381654Z self = 2025-05-07T20:33:10.1382803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1384291Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98570dd580>} 2025-05-07T20:33:10.1385810Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1386908Z context = 2025-05-07T20:33:10.1387254Z 2025-05-07T20:33:10.1387426Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1387971Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1388458Z module_map=module_map) 2025-05-07T20:33:10.1388839Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1389202Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1389473Z E ^ 2025-05-07T20:33:10.1389959Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1390433Z 2025-05-07T20:33:10.1390878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1391426Z 2025-05-07T20:33:10.1391538Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.1391966Z self=, 2025-05-07T20:33:10.1392387Z T=16384, 2025-05-07T20:33:10.1392571Z D=7168, 2025-05-07T20:33:10.1392764Z scale_ub=1200.0, 2025-05-07T20:33:10.1392980Z contiguous=True, 2025-05-07T20:33:10.1393196Z compiled=True, 2025-05-07T20:33:10.1393398Z ) 2025-05-07T20:33:10.1393721Z self = 2025-05-07T20:33:10.1394230Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:10.1394520Z 2025-05-07T20:33:10.1394594Z @given( 2025-05-07T20:33:10.1394825Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.1395135Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.1395446Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.1395776Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.1396103Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.1396388Z ) 2025-05-07T20:33:10.1396738Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.1397192Z def test_silu_mul_quant( 2025-05-07T20:33:10.1397428Z self, 2025-05-07T20:33:10.1397621Z T: int, 2025-05-07T20:33:10.1397814Z D: int, 2025-05-07T20:33:10.1398024Z scale_ub: Optional[float], 2025-05-07T20:33:10.1398298Z contiguous: bool, 2025-05-07T20:33:10.1398542Z compiled: bool, 2025-05-07T20:33:10.1398757Z ) -> None: 2025-05-07T20:33:10.1398966Z torch.manual_seed(2025) 2025-05-07T20:33:10.1399208Z 2025-05-07T20:33:10.1399477Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.1399830Z 2025-05-07T20:33:10.1400033Z x_sign = torch.sign(x) 2025-05-07T20:33:10.1400317Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.1400629Z x = x_sign * x_clamp 2025-05-07T20:33:10.1400866Z x0 = x[:, :D] 2025-05-07T20:33:10.1401084Z x1 = x[:, D:] 2025-05-07T20:33:10.1401282Z 2025-05-07T20:33:10.1401466Z if contiguous: 2025-05-07T20:33:10.1401696Z x0 = x0.contiguous() 2025-05-07T20:33:10.1401949Z x1 = x1.contiguous() 2025-05-07T20:33:10.1402189Z 2025-05-07T20:33:10.1402378Z if scale_ub is not None: 2025-05-07T20:33:10.1402646Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.1403424Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.1403732Z ) 2025-05-07T20:33:10.1403922Z else: 2025-05-07T20:33:10.1404128Z scale_ub_tensor = None 2025-05-07T20:33:10.1404382Z 2025-05-07T20:33:10.1404691Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.1405031Z op = silu_mul_quant 2025-05-07T20:33:10.1405287Z if compiled: 2025-05-07T20:33:10.1405533Z op = torch.compile(op) 2025-05-07T20:33:10.1405839Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1406166Z 2025-05-07T20:33:10.1406365Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.1406531Z 2025-05-07T20:33:10.1406630Z moe/activation_test.py:117: 2025-05-07T20:33:10.1406930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1407272Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.1407558Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.1408147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.1408788Z return fn(*args, **kwargs) 2025-05-07T20:33:10.1409477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.1410206Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.1410769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.1411491Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.1412190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.1412753Z kernel = self.compile( 2025-05-07T20:33:10.1413322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.1414018Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.1414492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.1414735Z 2025-05-07T20:33:10.1414952Z self = 2025-05-07T20:33:10.1416082Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.1417506Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98570de0c0>} 2025-05-07T20:33:10.1418910Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.1420001Z context = 2025-05-07T20:33:10.1420303Z 2025-05-07T20:33:10.1420474Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.1421021Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.1421502Z module_map=module_map) 2025-05-07T20:33:10.1421875Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.1422239Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.1422503Z E ^ 2025-05-07T20:33:10.1422985Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.1423462Z 2025-05-07T20:33:10.1423900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.1424496Z 2025-05-07T20:33:10.2660339Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2660803Z self=, 2025-05-07T20:33:10.2661237Z T=16384, 2025-05-07T20:33:10.2661526Z D=5120, 2025-05-07T20:33:10.2661791Z scale_ub=1200.0, 2025-05-07T20:33:10.2662259Z contiguous=True, 2025-05-07T20:33:10.2662554Z compiled=False, 2025-05-07T20:33:10.2662819Z ) 2025-05-07T20:33:10.2663143Z self = 2025-05-07T20:33:10.2663650Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:10.2664011Z 2025-05-07T20:33:10.2664088Z @given( 2025-05-07T20:33:10.2664313Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2664626Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2664931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2665262Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2665591Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2665884Z ) 2025-05-07T20:33:10.2666232Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2666683Z def test_silu_mul_quant( 2025-05-07T20:33:10.2666928Z self, 2025-05-07T20:33:10.2667122Z T: int, 2025-05-07T20:33:10.2667310Z D: int, 2025-05-07T20:33:10.2667530Z scale_ub: Optional[float], 2025-05-07T20:33:10.2667805Z contiguous: bool, 2025-05-07T20:33:10.2668043Z compiled: bool, 2025-05-07T20:33:10.2668258Z ) -> None: 2025-05-07T20:33:10.2668467Z torch.manual_seed(2025) 2025-05-07T20:33:10.2668705Z 2025-05-07T20:33:10.2668972Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2669323Z 2025-05-07T20:33:10.2669515Z x_sign = torch.sign(x) 2025-05-07T20:33:10.2669799Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.2670113Z x = x_sign * x_clamp 2025-05-07T20:33:10.2670354Z x0 = x[:, :D] 2025-05-07T20:33:10.2670562Z x1 = x[:, D:] 2025-05-07T20:33:10.2670767Z 2025-05-07T20:33:10.2670948Z if contiguous: 2025-05-07T20:33:10.2671175Z x0 = x0.contiguous() 2025-05-07T20:33:10.2671440Z x1 = x1.contiguous() 2025-05-07T20:33:10.2671680Z 2025-05-07T20:33:10.2671862Z if scale_ub is not None: 2025-05-07T20:33:10.2672132Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.2672466Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.2672776Z ) 2025-05-07T20:33:10.2672990Z else: 2025-05-07T20:33:10.2673200Z scale_ub_tensor = None 2025-05-07T20:33:10.2673455Z 2025-05-07T20:33:10.2673685Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.2673995Z op = silu_mul_quant 2025-05-07T20:33:10.2674247Z if compiled: 2025-05-07T20:33:10.2674494Z op = torch.compile(op) 2025-05-07T20:33:10.2674787Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.2675064Z 2025-05-07T20:33:10.2675252Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.2675414Z 2025-05-07T20:33:10.2675509Z moe/activation_test.py:117: 2025-05-07T20:33:10.2675809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.2676147Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.2676422Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.2677141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.2677867Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.2678418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.2679126Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.2679893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.2680451Z kernel = self.compile( 2025-05-07T20:33:10.2681090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.2681778Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.2682192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.2682466Z 2025-05-07T20:33:10.2682680Z self = 2025-05-07T20:33:10.2683799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.2685222Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98570df1a0>} 2025-05-07T20:33:10.2686636Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.2687729Z context = 2025-05-07T20:33:10.2688029Z 2025-05-07T20:33:10.2688202Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.2688738Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.2689228Z module_map=module_map) 2025-05-07T20:33:10.2689603Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.2689964Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.2696984Z E ^ 2025-05-07T20:33:10.2697612Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.2698095Z 2025-05-07T20:33:10.2698537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.2699121Z 2025-05-07T20:33:10.2699250Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.2699680Z self=, 2025-05-07T20:33:10.2700096Z T=1, 2025-05-07T20:33:10.2700290Z D=7168, 2025-05-07T20:33:10.2700483Z scale_ub=1200.0, 2025-05-07T20:33:10.2700702Z contiguous=False, 2025-05-07T20:33:10.2700927Z compiled=False, 2025-05-07T20:33:10.2701129Z ) 2025-05-07T20:33:10.2701442Z self = 2025-05-07T20:33:10.2701950Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:10.2702231Z 2025-05-07T20:33:10.2702319Z @given( 2025-05-07T20:33:10.2702548Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.2702862Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.2703178Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.2703517Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.2703844Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.2704131Z ) 2025-05-07T20:33:10.2704485Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.2704938Z def test_silu_mul_quant( 2025-05-07T20:33:10.2705180Z self, 2025-05-07T20:33:10.2705371Z T: int, 2025-05-07T20:33:10.2705558Z D: int, 2025-05-07T20:33:10.2705773Z scale_ub: Optional[float], 2025-05-07T20:33:10.2706045Z contiguous: bool, 2025-05-07T20:33:10.2706276Z compiled: bool, 2025-05-07T20:33:10.2706495Z ) -> None: 2025-05-07T20:33:10.2706792Z torch.manual_seed(2025) 2025-05-07T20:33:10.2707034Z 2025-05-07T20:33:10.2707309Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.2707663Z 2025-05-07T20:33:10.2707856Z x_sign = torch.sign(x) 2025-05-07T20:33:10.2708245Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.2708567Z x = x_sign * x_clamp 2025-05-07T20:33:10.2708809Z x0 = x[:, :D] 2025-05-07T20:33:10.2709020Z x1 = x[:, D:] 2025-05-07T20:33:10.2709227Z 2025-05-07T20:33:10.2709406Z if contiguous: 2025-05-07T20:33:10.2709673Z x0 = x0.contiguous() 2025-05-07T20:33:10.2709934Z x1 = x1.contiguous() 2025-05-07T20:33:10.2710180Z 2025-05-07T20:33:10.2710365Z if scale_ub is not None: 2025-05-07T20:33:10.2710642Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.2710981Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.2711292Z ) 2025-05-07T20:33:10.2711494Z else: 2025-05-07T20:33:10.2711706Z scale_ub_tensor = None 2025-05-07T20:33:10.2711955Z 2025-05-07T20:33:10.2712186Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.2712501Z op = silu_mul_quant 2025-05-07T20:33:10.2712752Z if compiled: 2025-05-07T20:33:10.2712997Z op = torch.compile(op) 2025-05-07T20:33:10.2713295Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.2713572Z 2025-05-07T20:33:10.2713755Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.2713924Z 2025-05-07T20:33:10.2714022Z moe/activation_test.py:117: 2025-05-07T20:33:10.2714320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.2714647Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.2714928Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.2715645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.2716370Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.2716922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.2717644Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.2718339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.2718890Z kernel = self.compile( 2025-05-07T20:33:10.2719451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.2720138Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.2720546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.2720782Z 2025-05-07T20:33:10.2720993Z self = 2025-05-07T20:33:10.2722124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.2723550Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9856c30680>} 2025-05-07T20:33:10.2724959Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.2726329Z context = 2025-05-07T20:33:10.2726626Z 2025-05-07T20:33:10.2726797Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.2727335Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.2727913Z module_map=module_map) 2025-05-07T20:33:10.2728284Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.2728660Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.2728975Z E ^ 2025-05-07T20:33:10.2729578Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.2730054Z 2025-05-07T20:33:10.2730491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.2731096Z 2025-05-07T20:33:10.4478664Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.4479097Z self=, 2025-05-07T20:33:10.4479945Z T=4096, 2025-05-07T20:33:10.4480462Z D=7168, 2025-05-07T20:33:10.4480882Z scale_ub=1200.0, 2025-05-07T20:33:10.4481342Z contiguous=False, 2025-05-07T20:33:10.4481827Z compiled=True, 2025-05-07T20:33:10.4482221Z ) 2025-05-07T20:33:10.4482864Z self = 2025-05-07T20:33:10.4483891Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:10.4484473Z 2025-05-07T20:33:10.4484632Z @given( 2025-05-07T20:33:10.4485082Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.4485718Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.4486332Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.4486999Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.4487669Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.4488244Z ) 2025-05-07T20:33:10.4488760Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.4489249Z def test_silu_mul_quant( 2025-05-07T20:33:10.4489501Z self, 2025-05-07T20:33:10.4489696Z T: int, 2025-05-07T20:33:10.4489894Z D: int, 2025-05-07T20:33:10.4490121Z scale_ub: Optional[float], 2025-05-07T20:33:10.4490395Z contiguous: bool, 2025-05-07T20:33:10.4490645Z compiled: bool, 2025-05-07T20:33:10.4490872Z ) -> None: 2025-05-07T20:33:10.4491084Z torch.manual_seed(2025) 2025-05-07T20:33:10.4491326Z 2025-05-07T20:33:10.4491602Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.4491948Z 2025-05-07T20:33:10.4492137Z x_sign = torch.sign(x) 2025-05-07T20:33:10.4492427Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.4492734Z x = x_sign * x_clamp 2025-05-07T20:33:10.4492972Z x0 = x[:, :D] 2025-05-07T20:33:10.4493183Z x1 = x[:, D:] 2025-05-07T20:33:10.4493388Z 2025-05-07T20:33:10.4493564Z if contiguous: 2025-05-07T20:33:10.4493796Z x0 = x0.contiguous() 2025-05-07T20:33:10.4494054Z x1 = x1.contiguous() 2025-05-07T20:33:10.4494285Z 2025-05-07T20:33:10.4494596Z if scale_ub is not None: 2025-05-07T20:33:10.4494873Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.4495209Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.4495529Z ) 2025-05-07T20:33:10.4495719Z else: 2025-05-07T20:33:10.4495924Z scale_ub_tensor = None 2025-05-07T20:33:10.4496174Z 2025-05-07T20:33:10.4496401Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.4496717Z op = silu_mul_quant 2025-05-07T20:33:10.4496968Z if compiled: 2025-05-07T20:33:10.4497222Z op = torch.compile(op) 2025-05-07T20:33:10.4497511Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.4497783Z 2025-05-07T20:33:10.4497970Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.4498139Z 2025-05-07T20:33:10.4498246Z moe/activation_test.py:117: 2025-05-07T20:33:10.4498676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.4499046Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.4499330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.4500025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.4500627Z return fn(*args, **kwargs) 2025-05-07T20:33:10.4501320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.4502139Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.4502700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.4503417Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.4504125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.4504685Z kernel = self.compile( 2025-05-07T20:33:10.4505248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.4505945Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.4506368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.4506611Z 2025-05-07T20:33:10.4506821Z self = 2025-05-07T20:33:10.4507940Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.4509373Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9856c31940>} 2025-05-07T20:33:10.4510787Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.4511872Z context = 2025-05-07T20:33:10.4512176Z 2025-05-07T20:33:10.4512345Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.4512886Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.4513375Z module_map=module_map) 2025-05-07T20:33:10.4513742Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.4514107Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.4514373Z E ^ 2025-05-07T20:33:10.4514849Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.4515331Z 2025-05-07T20:33:10.4515766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.4516316Z 2025-05-07T20:33:10.4516421Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.4516851Z self=, 2025-05-07T20:33:10.4517260Z T=128, 2025-05-07T20:33:10.4517446Z D=7168, 2025-05-07T20:33:10.4517639Z scale_ub=1200.0, 2025-05-07T20:33:10.4517862Z contiguous=False, 2025-05-07T20:33:10.4518097Z compiled=True, 2025-05-07T20:33:10.4518307Z ) 2025-05-07T20:33:10.5436688Z self = 2025-05-07T20:33:10.5437317Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:10.5437721Z 2025-05-07T20:33:10.5437832Z @given( 2025-05-07T20:33:10.5438150Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.5438679Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.5438999Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.5439328Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.5439665Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.5440081Z ) 2025-05-07T20:33:10.5440447Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.5440902Z def test_silu_mul_quant( 2025-05-07T20:33:10.5441143Z self, 2025-05-07T20:33:10.5441338Z T: int, 2025-05-07T20:33:10.5441607Z D: int, 2025-05-07T20:33:10.5441834Z scale_ub: Optional[float], 2025-05-07T20:33:10.5442109Z contiguous: bool, 2025-05-07T20:33:10.5442354Z compiled: bool, 2025-05-07T20:33:10.5442573Z ) -> None: 2025-05-07T20:33:10.5442792Z torch.manual_seed(2025) 2025-05-07T20:33:10.5443037Z 2025-05-07T20:33:10.5443309Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.5443668Z 2025-05-07T20:33:10.5443865Z x_sign = torch.sign(x) 2025-05-07T20:33:10.5444149Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.5444469Z x = x_sign * x_clamp 2025-05-07T20:33:10.5444718Z x0 = x[:, :D] 2025-05-07T20:33:10.5444938Z x1 = x[:, D:] 2025-05-07T20:33:10.5445147Z 2025-05-07T20:33:10.5445335Z if contiguous: 2025-05-07T20:33:10.5445563Z x0 = x0.contiguous() 2025-05-07T20:33:10.5445829Z x1 = x1.contiguous() 2025-05-07T20:33:10.5446081Z 2025-05-07T20:33:10.5446275Z if scale_ub is not None: 2025-05-07T20:33:10.5446549Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.5446893Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.5447204Z ) 2025-05-07T20:33:10.5447403Z else: 2025-05-07T20:33:10.5447630Z scale_ub_tensor = None 2025-05-07T20:33:10.5447897Z 2025-05-07T20:33:10.5448130Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.5448451Z op = silu_mul_quant 2025-05-07T20:33:10.5448705Z if compiled: 2025-05-07T20:33:10.5448951Z op = torch.compile(op) 2025-05-07T20:33:10.5449260Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.5449535Z 2025-05-07T20:33:10.5449724Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.5449894Z 2025-05-07T20:33:10.5449993Z moe/activation_test.py:117: 2025-05-07T20:33:10.5450289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.5450637Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.5450917Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.5451496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.5452082Z return fn(*args, **kwargs) 2025-05-07T20:33:10.5452892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.5453624Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.5454182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.5454951Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.5455642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.5456205Z kernel = self.compile( 2025-05-07T20:33:10.5456772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.5457457Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.5457874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.5458119Z 2025-05-07T20:33:10.5458410Z self = 2025-05-07T20:33:10.5459606Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.5461028Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9856c32700>} 2025-05-07T20:33:10.5462474Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.5463563Z context = 2025-05-07T20:33:10.5463868Z 2025-05-07T20:33:10.5464039Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.5464583Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.5465068Z module_map=module_map) 2025-05-07T20:33:10.5465471Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.5465853Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.5466133Z E ^ 2025-05-07T20:33:10.5466612Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.5467092Z 2025-05-07T20:33:10.5467535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.5468084Z 2025-05-07T20:33:10.5468192Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.5468626Z self=, 2025-05-07T20:33:10.5469050Z T=2048, 2025-05-07T20:33:10.5469257Z D=7168, 2025-05-07T20:33:10.5469461Z scale_ub=None, 2025-05-07T20:33:10.5469677Z contiguous=True, 2025-05-07T20:33:10.5469909Z compiled=True, 2025-05-07T20:33:10.5470117Z ) 2025-05-07T20:33:10.5470446Z self = 2025-05-07T20:33:10.5470962Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:10.5471247Z 2025-05-07T20:33:10.5471332Z @given( 2025-05-07T20:33:10.5471569Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.5471885Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.5472205Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.5472547Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.5472881Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.5473177Z ) 2025-05-07T20:33:10.5473539Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.5473996Z def test_silu_mul_quant( 2025-05-07T20:33:10.5474245Z self, 2025-05-07T20:33:10.5474448Z T: int, 2025-05-07T20:33:10.5474641Z D: int, 2025-05-07T20:33:10.5474862Z scale_ub: Optional[float], 2025-05-07T20:33:10.5475141Z contiguous: bool, 2025-05-07T20:33:10.5475391Z compiled: bool, 2025-05-07T20:33:10.5475612Z ) -> None: 2025-05-07T20:33:10.5475830Z torch.manual_seed(2025) 2025-05-07T20:33:10.5476077Z 2025-05-07T20:33:10.5476346Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.5476701Z 2025-05-07T20:33:10.5476895Z x_sign = torch.sign(x) 2025-05-07T20:33:10.5477179Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.5477493Z x = x_sign * x_clamp 2025-05-07T20:33:10.5477733Z x0 = x[:, :D] 2025-05-07T20:33:10.5477945Z x1 = x[:, D:] 2025-05-07T20:33:10.5478154Z 2025-05-07T20:33:10.5478333Z if contiguous: 2025-05-07T20:33:10.5478612Z x0 = x0.contiguous() 2025-05-07T20:33:10.5478868Z x1 = x1.contiguous() 2025-05-07T20:33:10.5479105Z 2025-05-07T20:33:10.5479292Z if scale_ub is not None: 2025-05-07T20:33:10.5479565Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.5479977Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.5480291Z ) 2025-05-07T20:33:10.5480476Z else: 2025-05-07T20:33:10.5480683Z scale_ub_tensor = None 2025-05-07T20:33:10.5480936Z 2025-05-07T20:33:10.5481163Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.5481532Z op = silu_mul_quant 2025-05-07T20:33:10.5481789Z if compiled: 2025-05-07T20:33:10.5482039Z op = torch.compile(op) 2025-05-07T20:33:10.5482350Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.5482636Z 2025-05-07T20:33:10.5482831Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.5483008Z 2025-05-07T20:33:10.5483112Z moe/activation_test.py:117: 2025-05-07T20:33:10.5483420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.5483763Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.5484054Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.5484648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:10.5485235Z return fn(*args, **kwargs) 2025-05-07T20:33:10.5485919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.5486649Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.5487212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.5487928Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.5488630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.5489201Z kernel = self.compile( 2025-05-07T20:33:10.5489765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.5490455Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.5490871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.5491113Z 2025-05-07T20:33:10.5491332Z self = 2025-05-07T20:33:10.5492464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.5493886Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9856c337e0>} 2025-05-07T20:33:10.5495386Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.5496478Z context = 2025-05-07T20:33:10.5496779Z 2025-05-07T20:33:10.5496957Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.5497497Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.5497989Z module_map=module_map) 2025-05-07T20:33:10.5498364Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.5498729Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.5498995Z E ^ 2025-05-07T20:33:10.5499522Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.5500043Z 2025-05-07T20:33:10.5500480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.5501020Z 2025-05-07T20:33:10.6100871Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.6101523Z self=, 2025-05-07T20:33:10.6102100Z T=16384, 2025-05-07T20:33:10.6102364Z D=5120, 2025-05-07T20:33:10.6102633Z scale_ub=None, 2025-05-07T20:33:10.6102997Z contiguous=False, 2025-05-07T20:33:10.6103231Z compiled=False, 2025-05-07T20:33:10.6103441Z ) 2025-05-07T20:33:10.6103764Z self = 2025-05-07T20:33:10.6104292Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:10.6104585Z 2025-05-07T20:33:10.6104676Z @given( 2025-05-07T20:33:10.6104911Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.6105232Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.6105546Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.6105881Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.6106229Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.6106528Z ) 2025-05-07T20:33:10.6106885Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.6107335Z def test_silu_mul_quant( 2025-05-07T20:33:10.6107579Z self, 2025-05-07T20:33:10.6107770Z T: int, 2025-05-07T20:33:10.6107965Z D: int, 2025-05-07T20:33:10.6108183Z scale_ub: Optional[float], 2025-05-07T20:33:10.6108461Z contiguous: bool, 2025-05-07T20:33:10.6108697Z compiled: bool, 2025-05-07T20:33:10.6108923Z ) -> None: 2025-05-07T20:33:10.6109140Z torch.manual_seed(2025) 2025-05-07T20:33:10.6109380Z 2025-05-07T20:33:10.6109649Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.6110002Z 2025-05-07T20:33:10.6110190Z x_sign = torch.sign(x) 2025-05-07T20:33:10.6110480Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.6112638Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.6114661Z 2025-05-07T20:33:10.6114779Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:10.6115004Z 2025-05-07T20:33:10.6115104Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.6115524Z self=, 2025-05-07T20:33:10.6115957Z T=4096, 2025-05-07T20:33:10.6116139Z D=7168, 2025-05-07T20:33:10.6116330Z scale_ub=1200.0, 2025-05-07T20:33:10.6116559Z contiguous=True, 2025-05-07T20:33:10.6116771Z compiled=True, 2025-05-07T20:33:10.6116971Z ) 2025-05-07T20:33:10.6117292Z self = 2025-05-07T20:33:10.6117792Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:10.6118082Z 2025-05-07T20:33:10.6118160Z @given( 2025-05-07T20:33:10.6125921Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.6126268Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.6126586Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.6126915Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.6127367Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.6127650Z ) 2025-05-07T20:33:10.6127995Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.6128447Z def test_silu_mul_quant( 2025-05-07T20:33:10.6128685Z self, 2025-05-07T20:33:10.6128983Z T: int, 2025-05-07T20:33:10.6129183Z D: int, 2025-05-07T20:33:10.6129400Z scale_ub: Optional[float], 2025-05-07T20:33:10.6129676Z contiguous: bool, 2025-05-07T20:33:10.6129920Z compiled: bool, 2025-05-07T20:33:10.6130205Z ) -> None: 2025-05-07T20:33:10.6130418Z torch.manual_seed(2025) 2025-05-07T20:33:10.6130659Z 2025-05-07T20:33:10.6130944Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.6131302Z 2025-05-07T20:33:10.6131494Z x_sign = torch.sign(x) 2025-05-07T20:33:10.6131790Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.6133961Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.6136058Z 2025-05-07T20:33:10.6136186Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:10.6136405Z 2025-05-07T20:33:10.6136509Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.6136940Z self=, 2025-05-07T20:33:10.6137361Z T=16384, 2025-05-07T20:33:10.6137568Z D=7168, 2025-05-07T20:33:10.6137756Z scale_ub=None, 2025-05-07T20:33:10.6137964Z contiguous=False, 2025-05-07T20:33:10.6138186Z compiled=False, 2025-05-07T20:33:10.6138381Z ) 2025-05-07T20:33:10.6138728Z self = 2025-05-07T20:33:10.6139273Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:10.6139555Z 2025-05-07T20:33:10.6139628Z @given( 2025-05-07T20:33:10.6139849Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.6140162Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.6140461Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.6140789Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.6141122Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.6141407Z ) 2025-05-07T20:33:10.6141749Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.6142201Z def test_silu_mul_quant( 2025-05-07T20:33:10.6142440Z self, 2025-05-07T20:33:10.6142624Z T: int, 2025-05-07T20:33:10.6142815Z D: int, 2025-05-07T20:33:10.6143029Z scale_ub: Optional[float], 2025-05-07T20:33:10.6143293Z contiguous: bool, 2025-05-07T20:33:10.6143532Z compiled: bool, 2025-05-07T20:33:10.6143753Z ) -> None: 2025-05-07T20:33:10.6143957Z torch.manual_seed(2025) 2025-05-07T20:33:10.6144197Z 2025-05-07T20:33:10.6144468Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.6146665Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.6148732Z 2025-05-07T20:33:10.6148854Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.6149072Z 2025-05-07T20:33:10.6149175Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.6149672Z self=, 2025-05-07T20:33:10.6150095Z T=2048, 2025-05-07T20:33:10.6150273Z D=7168, 2025-05-07T20:33:10.6150458Z scale_ub=1200.0, 2025-05-07T20:33:10.6150714Z contiguous=True, 2025-05-07T20:33:10.6150924Z compiled=True, 2025-05-07T20:33:10.6151122Z ) 2025-05-07T20:33:10.6151440Z self = 2025-05-07T20:33:10.6151933Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:10.6152214Z 2025-05-07T20:33:10.6152290Z @given( 2025-05-07T20:33:10.6152517Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.6152836Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.6153139Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.6153472Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.6153810Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.6154088Z ) 2025-05-07T20:33:10.6154440Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.6154888Z def test_silu_mul_quant( 2025-05-07T20:33:10.6155124Z self, 2025-05-07T20:33:10.6155319Z T: int, 2025-05-07T20:33:10.6155507Z D: int, 2025-05-07T20:33:10.6155714Z scale_ub: Optional[float], 2025-05-07T20:33:10.6155985Z contiguous: bool, 2025-05-07T20:33:10.6156219Z compiled: bool, 2025-05-07T20:33:10.6156436Z ) -> None: 2025-05-07T20:33:10.6156638Z torch.manual_seed(2025) 2025-05-07T20:33:10.6156882Z 2025-05-07T20:33:10.6157154Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.6157495Z 2025-05-07T20:33:10.6157680Z x_sign = torch.sign(x) 2025-05-07T20:33:10.6157966Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.6160084Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.6162081Z 2025-05-07T20:33:10.6162195Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:10.6162410Z 2025-05-07T20:33:10.6162514Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.6162927Z self=, 2025-05-07T20:33:10.6163339Z T=2048, 2025-05-07T20:33:10.6163525Z D=7168, 2025-05-07T20:33:10.6163717Z scale_ub=None, 2025-05-07T20:33:10.6163934Z contiguous=True, 2025-05-07T20:33:10.6164157Z compiled=False, 2025-05-07T20:33:10.6164371Z ) 2025-05-07T20:33:10.7299845Z self = 2025-05-07T20:33:10.7301265Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:10.7301858Z 2025-05-07T20:33:10.7302013Z @given( 2025-05-07T20:33:10.7302466Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.7303086Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.7303699Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.7304356Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.7305265Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.7305827Z ) 2025-05-07T20:33:10.7306517Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.7307411Z def test_silu_mul_quant( 2025-05-07T20:33:10.7307872Z self, 2025-05-07T20:33:10.7308477Z T: int, 2025-05-07T20:33:10.7308787Z D: int, 2025-05-07T20:33:10.7309030Z scale_ub: Optional[float], 2025-05-07T20:33:10.7309321Z contiguous: bool, 2025-05-07T20:33:10.7309560Z compiled: bool, 2025-05-07T20:33:10.7309776Z ) -> None: 2025-05-07T20:33:10.7310059Z torch.manual_seed(2025) 2025-05-07T20:33:10.7310311Z 2025-05-07T20:33:10.7310587Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.7310942Z 2025-05-07T20:33:10.7311150Z > x_sign = torch.sign(x) 2025-05-07T20:33:10.7313233Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.7315230Z 2025-05-07T20:33:10.7315358Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:10.7315580Z 2025-05-07T20:33:10.7315684Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.7316116Z self=, 2025-05-07T20:33:10.7316537Z T=1, 2025-05-07T20:33:10.7316721Z D=7168, 2025-05-07T20:33:10.7316920Z scale_ub=1200.0, 2025-05-07T20:33:10.7317149Z contiguous=True, 2025-05-07T20:33:10.7317367Z compiled=False, 2025-05-07T20:33:10.7317583Z ) 2025-05-07T20:33:10.7317908Z self = 2025-05-07T20:33:10.7318407Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:10.7318689Z 2025-05-07T20:33:10.7318772Z @given( 2025-05-07T20:33:10.7319012Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.7319333Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.7319644Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.7319981Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.7320321Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.7320614Z ) 2025-05-07T20:33:10.7320968Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.7321436Z def test_silu_mul_quant( 2025-05-07T20:33:10.7321683Z self, 2025-05-07T20:33:10.7321885Z T: int, 2025-05-07T20:33:10.7322077Z D: int, 2025-05-07T20:33:10.7322295Z scale_ub: Optional[float], 2025-05-07T20:33:10.7322573Z contiguous: bool, 2025-05-07T20:33:10.7322810Z compiled: bool, 2025-05-07T20:33:10.7323028Z ) -> None: 2025-05-07T20:33:10.7323249Z torch.manual_seed(2025) 2025-05-07T20:33:10.7323497Z 2025-05-07T20:33:10.7323775Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.7324118Z 2025-05-07T20:33:10.7324314Z x_sign = torch.sign(x) 2025-05-07T20:33:10.7324603Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.7324917Z x = x_sign * x_clamp 2025-05-07T20:33:10.7325154Z x0 = x[:, :D] 2025-05-07T20:33:10.7325370Z x1 = x[:, D:] 2025-05-07T20:33:10.7325758Z 2025-05-07T20:33:10.7325950Z if contiguous: 2025-05-07T20:33:10.7326189Z x0 = x0.contiguous() 2025-05-07T20:33:10.7326452Z x1 = x1.contiguous() 2025-05-07T20:33:10.7326701Z 2025-05-07T20:33:10.7326976Z if scale_ub is not None: 2025-05-07T20:33:10.7327244Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.7327583Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.7327896Z ) 2025-05-07T20:33:10.7328091Z else: 2025-05-07T20:33:10.7328443Z scale_ub_tensor = None 2025-05-07T20:33:10.7328703Z 2025-05-07T20:33:10.7328929Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.7329277Z op = silu_mul_quant 2025-05-07T20:33:10.7329547Z if compiled: 2025-05-07T20:33:10.7329853Z op = torch.compile(op) 2025-05-07T20:33:10.7330152Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.7330428Z 2025-05-07T20:33:10.7330616Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.7330784Z 2025-05-07T20:33:10.7330883Z moe/activation_test.py:117: 2025-05-07T20:33:10.7331180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.7331524Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.7331804Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.7332526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.7333265Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.7333819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.7334648Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.7335348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.7335909Z kernel = self.compile( 2025-05-07T20:33:10.7336469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.7337159Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.7337576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.7337817Z 2025-05-07T20:33:10.7338032Z self = 2025-05-07T20:33:10.7339159Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.7340582Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9856b22b60>} 2025-05-07T20:33:10.7341992Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.7343083Z context = 2025-05-07T20:33:10.7343382Z 2025-05-07T20:33:10.7343549Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.7344086Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.7344575Z module_map=module_map) 2025-05-07T20:33:10.7344945Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.7345298Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.7345563Z E ^ 2025-05-07T20:33:10.7346039Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.7346511Z 2025-05-07T20:33:10.7346951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.7347495Z 2025-05-07T20:33:10.7347600Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.7348083Z self=, 2025-05-07T20:33:10.7348502Z T=128, 2025-05-07T20:33:10.7348685Z D=5120, 2025-05-07T20:33:10.7348879Z scale_ub=None, 2025-05-07T20:33:10.7349087Z contiguous=True, 2025-05-07T20:33:10.7349305Z compiled=False, 2025-05-07T20:33:10.7349589Z ) 2025-05-07T20:33:10.8028475Z self = 2025-05-07T20:33:10.8029290Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:10.8029829Z 2025-05-07T20:33:10.8029937Z @given( 2025-05-07T20:33:10.8030260Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.8030650Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.8030960Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.8031291Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.8031624Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.8031922Z ) 2025-05-07T20:33:10.8032269Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.8032726Z def test_silu_mul_quant( 2025-05-07T20:33:10.8032968Z self, 2025-05-07T20:33:10.8033156Z T: int, 2025-05-07T20:33:10.8033360Z D: int, 2025-05-07T20:33:10.8033575Z scale_ub: Optional[float], 2025-05-07T20:33:10.8033840Z contiguous: bool, 2025-05-07T20:33:10.8034074Z compiled: bool, 2025-05-07T20:33:10.8034293Z ) -> None: 2025-05-07T20:33:10.8034502Z torch.manual_seed(2025) 2025-05-07T20:33:10.8034757Z 2025-05-07T20:33:10.8035043Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.8035401Z 2025-05-07T20:33:10.8035596Z x_sign = torch.sign(x) 2025-05-07T20:33:10.8035894Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.8036216Z x = x_sign * x_clamp 2025-05-07T20:33:10.8036459Z x0 = x[:, :D] 2025-05-07T20:33:10.8036684Z x1 = x[:, D:] 2025-05-07T20:33:10.8036901Z 2025-05-07T20:33:10.8037088Z if contiguous: 2025-05-07T20:33:10.8037327Z x0 = x0.contiguous() 2025-05-07T20:33:10.8037592Z x1 = x1.contiguous() 2025-05-07T20:33:10.8037840Z 2025-05-07T20:33:10.8038043Z if scale_ub is not None: 2025-05-07T20:33:10.8038322Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.8038662Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.8038985Z ) 2025-05-07T20:33:10.8039186Z else: 2025-05-07T20:33:10.8039397Z scale_ub_tensor = None 2025-05-07T20:33:10.8039659Z 2025-05-07T20:33:10.8039892Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.8040207Z op = silu_mul_quant 2025-05-07T20:33:10.8040450Z if compiled: 2025-05-07T20:33:10.8040697Z op = torch.compile(op) 2025-05-07T20:33:10.8040997Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.8041272Z 2025-05-07T20:33:10.8041458Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.8041620Z 2025-05-07T20:33:10.8041724Z moe/activation_test.py:117: 2025-05-07T20:33:10.8042024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.8042366Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.8042649Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.8043362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.8044091Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.8044652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.8045370Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.8046059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.8046704Z kernel = self.compile( 2025-05-07T20:33:10.8047274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.8048083Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.8048500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.8048744Z 2025-05-07T20:33:10.8048958Z self = 2025-05-07T20:33:10.8050128Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.8051553Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9856b23c40>} 2025-05-07T20:33:10.8052966Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.8054060Z context = 2025-05-07T20:33:10.8054482Z 2025-05-07T20:33:10.8054658Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.8055202Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.8055693Z module_map=module_map) 2025-05-07T20:33:10.8056072Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.8056436Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.8056700Z E ^ 2025-05-07T20:33:10.8057183Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.8057665Z 2025-05-07T20:33:10.8058102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.8058646Z 2025-05-07T20:33:10.8058765Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.8059236Z self=, 2025-05-07T20:33:10.8059653Z T=128, 2025-05-07T20:33:10.8059844Z D=7168, 2025-05-07T20:33:10.8060037Z scale_ub=None, 2025-05-07T20:33:10.8060255Z contiguous=True, 2025-05-07T20:33:10.8060485Z compiled=False, 2025-05-07T20:33:10.8060691Z ) 2025-05-07T20:33:10.8061015Z self = 2025-05-07T20:33:10.8061529Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:10.8061809Z 2025-05-07T20:33:10.8061894Z @given( 2025-05-07T20:33:10.8062122Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.8062445Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.8062760Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.8063096Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.8063440Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.8063736Z ) 2025-05-07T20:33:10.8064087Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.8064547Z def test_silu_mul_quant( 2025-05-07T20:33:10.8064795Z self, 2025-05-07T20:33:10.8065002Z T: int, 2025-05-07T20:33:10.8065191Z D: int, 2025-05-07T20:33:10.8065409Z scale_ub: Optional[float], 2025-05-07T20:33:10.8065678Z contiguous: bool, 2025-05-07T20:33:10.8065915Z compiled: bool, 2025-05-07T20:33:10.8066136Z ) -> None: 2025-05-07T20:33:10.8066350Z torch.manual_seed(2025) 2025-05-07T20:33:10.8066588Z 2025-05-07T20:33:10.8066914Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.8067264Z 2025-05-07T20:33:10.8067455Z x_sign = torch.sign(x) 2025-05-07T20:33:10.8067750Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.8068063Z x = x_sign * x_clamp 2025-05-07T20:33:10.8068371Z x0 = x[:, :D] 2025-05-07T20:33:10.8068595Z x1 = x[:, D:] 2025-05-07T20:33:10.8068807Z 2025-05-07T20:33:10.8068987Z if contiguous: 2025-05-07T20:33:10.8069221Z x0 = x0.contiguous() 2025-05-07T20:33:10.8069521Z x1 = x1.contiguous() 2025-05-07T20:33:10.8069764Z 2025-05-07T20:33:10.8069968Z if scale_ub is not None: 2025-05-07T20:33:10.8070243Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.8070581Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.8070891Z ) 2025-05-07T20:33:10.8071093Z else: 2025-05-07T20:33:10.8071308Z scale_ub_tensor = None 2025-05-07T20:33:10.8071561Z 2025-05-07T20:33:10.8071789Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.8072107Z op = silu_mul_quant 2025-05-07T20:33:10.8072359Z if compiled: 2025-05-07T20:33:10.8072598Z op = torch.compile(op) 2025-05-07T20:33:10.8072905Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.8073186Z 2025-05-07T20:33:10.8073376Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.8073542Z 2025-05-07T20:33:10.8073639Z moe/activation_test.py:117: 2025-05-07T20:33:10.8073941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.8074278Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.8074560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.8075278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.8076003Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.8076563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.8077277Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.8077972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.8078524Z kernel = self.compile( 2025-05-07T20:33:10.8079089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.8079781Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.8080193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.8080433Z 2025-05-07T20:33:10.8080642Z self = 2025-05-07T20:33:10.8081762Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.8083187Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9856a00ae0>} 2025-05-07T20:33:10.8084592Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.8085678Z context = 2025-05-07T20:33:10.8085976Z 2025-05-07T20:33:10.8086145Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.8086687Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.8087225Z module_map=module_map) 2025-05-07T20:33:10.8087592Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.8087967Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.8088327Z E ^ 2025-05-07T20:33:10.8088919Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.8089435Z 2025-05-07T20:33:10.8089903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.8090496Z 2025-05-07T20:33:10.8090604Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.8091159Z self=, 2025-05-07T20:33:10.8098705Z T=2048, 2025-05-07T20:33:10.8099015Z D=7168, 2025-05-07T20:33:10.8099215Z scale_ub=1200.0, 2025-05-07T20:33:10.8099442Z contiguous=True, 2025-05-07T20:33:10.8099665Z compiled=False, 2025-05-07T20:33:10.8099872Z ) 2025-05-07T20:33:10.8907972Z self = 2025-05-07T20:33:10.8908801Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:10.8909230Z 2025-05-07T20:33:10.8909339Z @given( 2025-05-07T20:33:10.8909664Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.8909982Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.8910284Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.8910620Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.8910951Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.8911231Z ) 2025-05-07T20:33:10.8911585Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.8912043Z def test_silu_mul_quant( 2025-05-07T20:33:10.8912278Z self, 2025-05-07T20:33:10.8912474Z T: int, 2025-05-07T20:33:10.8912675Z D: int, 2025-05-07T20:33:10.8912887Z scale_ub: Optional[float], 2025-05-07T20:33:10.8913159Z contiguous: bool, 2025-05-07T20:33:10.8913393Z compiled: bool, 2025-05-07T20:33:10.8913614Z ) -> None: 2025-05-07T20:33:10.8913830Z torch.manual_seed(2025) 2025-05-07T20:33:10.8914069Z 2025-05-07T20:33:10.8914351Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.8916536Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.8918541Z 2025-05-07T20:33:10.8918658Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.8918877Z 2025-05-07T20:33:10.8918982Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.8919402Z self=, 2025-05-07T20:33:10.8919819Z T=1, 2025-05-07T20:33:10.8919998Z D=5120, 2025-05-07T20:33:10.8920181Z scale_ub=1200.0, 2025-05-07T20:33:10.8920399Z contiguous=True, 2025-05-07T20:33:10.8920613Z compiled=False, 2025-05-07T20:33:10.8920818Z ) 2025-05-07T20:33:10.8921145Z self = 2025-05-07T20:33:10.8921637Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:10.8921911Z 2025-05-07T20:33:10.8921988Z @given( 2025-05-07T20:33:10.8922207Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.8922518Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.8922950Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.8923285Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.8923617Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.8923900Z ) 2025-05-07T20:33:10.8924379Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.8924836Z def test_silu_mul_quant( 2025-05-07T20:33:10.8925072Z self, 2025-05-07T20:33:10.8925263Z T: int, 2025-05-07T20:33:10.8925629Z D: int, 2025-05-07T20:33:10.8925844Z scale_ub: Optional[float], 2025-05-07T20:33:10.8926194Z contiguous: bool, 2025-05-07T20:33:10.8926448Z compiled: bool, 2025-05-07T20:33:10.8926663Z ) -> None: 2025-05-07T20:33:10.8926881Z torch.manual_seed(2025) 2025-05-07T20:33:10.8927119Z 2025-05-07T20:33:10.8927391Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.8927744Z 2025-05-07T20:33:10.8927946Z x_sign = torch.sign(x) 2025-05-07T20:33:10.8928238Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:10.8928556Z x = x_sign * x_clamp 2025-05-07T20:33:10.8928801Z x0 = x[:, :D] 2025-05-07T20:33:10.8929016Z x1 = x[:, D:] 2025-05-07T20:33:10.8929228Z 2025-05-07T20:33:10.8929430Z if contiguous: 2025-05-07T20:33:10.8929667Z x0 = x0.contiguous() 2025-05-07T20:33:10.8929932Z x1 = x1.contiguous() 2025-05-07T20:33:10.8930183Z 2025-05-07T20:33:10.8930373Z if scale_ub is not None: 2025-05-07T20:33:10.8930653Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:10.8930993Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:10.8931317Z ) 2025-05-07T20:33:10.8931505Z else: 2025-05-07T20:33:10.8931718Z scale_ub_tensor = None 2025-05-07T20:33:10.8931974Z 2025-05-07T20:33:10.8932207Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:10.8932532Z op = silu_mul_quant 2025-05-07T20:33:10.8932782Z if compiled: 2025-05-07T20:33:10.8933024Z op = torch.compile(op) 2025-05-07T20:33:10.8933331Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.8933616Z 2025-05-07T20:33:10.8933813Z > y_fp8, y_scale = fn() 2025-05-07T20:33:10.8933984Z 2025-05-07T20:33:10.8934084Z moe/activation_test.py:117: 2025-05-07T20:33:10.8934486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.8934829Z moe/activation_test.py:115: in fn 2025-05-07T20:33:10.8935113Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:10.8935837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:10.8936566Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:10.8937122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:10.8937845Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:10.8938543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:10.8939162Z kernel = self.compile( 2025-05-07T20:33:10.8939721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:10.8940413Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:10.8940825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:10.8941062Z 2025-05-07T20:33:10.8941276Z self = 2025-05-07T20:33:10.8942402Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:10.8943908Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9856a020c0>} 2025-05-07T20:33:10.8945460Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:10.8946556Z context = 2025-05-07T20:33:10.8946899Z 2025-05-07T20:33:10.8947074Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:10.8947623Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:10.8948121Z module_map=module_map) 2025-05-07T20:33:10.8948501Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:10.8948870Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:10.8949151Z E ^ 2025-05-07T20:33:10.8949645Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:10.8950121Z 2025-05-07T20:33:10.8950566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:10.8951116Z 2025-05-07T20:33:10.8951221Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.8951651Z self=, 2025-05-07T20:33:10.8952072Z T=2048, 2025-05-07T20:33:10.8952260Z D=5120, 2025-05-07T20:33:10.8952461Z scale_ub=None, 2025-05-07T20:33:10.8952681Z contiguous=True, 2025-05-07T20:33:10.8952907Z compiled=False, 2025-05-07T20:33:10.8953114Z ) 2025-05-07T20:33:10.8953438Z self = 2025-05-07T20:33:10.8953948Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:10.8954240Z 2025-05-07T20:33:10.8954322Z @given( 2025-05-07T20:33:10.8954560Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.8954885Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.8955202Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.8955543Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.8955892Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.8956185Z ) 2025-05-07T20:33:10.8956546Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.8957011Z def test_silu_mul_quant( 2025-05-07T20:33:10.8957257Z self, 2025-05-07T20:33:10.8957462Z T: int, 2025-05-07T20:33:10.8957660Z D: int, 2025-05-07T20:33:10.8957881Z scale_ub: Optional[float], 2025-05-07T20:33:10.8958157Z contiguous: bool, 2025-05-07T20:33:10.8958410Z compiled: bool, 2025-05-07T20:33:10.8958627Z ) -> None: 2025-05-07T20:33:10.8958845Z torch.manual_seed(2025) 2025-05-07T20:33:10.8959089Z 2025-05-07T20:33:10.8959370Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.8959718Z 2025-05-07T20:33:10.8959925Z > x_sign = torch.sign(x) 2025-05-07T20:33:10.8962005Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.8964057Z 2025-05-07T20:33:10.8964185Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:10.8964405Z 2025-05-07T20:33:10.8964511Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.8964942Z self=, 2025-05-07T20:33:10.8965368Z T=16384, 2025-05-07T20:33:10.8965648Z D=5120, 2025-05-07T20:33:10.8965843Z scale_ub=None, 2025-05-07T20:33:10.8966063Z contiguous=True, 2025-05-07T20:33:10.8966286Z compiled=False, 2025-05-07T20:33:10.8966480Z ) 2025-05-07T20:33:10.9726614Z self = 2025-05-07T20:33:10.9727518Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:10.9727922Z 2025-05-07T20:33:10.9728033Z @given( 2025-05-07T20:33:10.9728340Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.9728793Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.9729222Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.9729632Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.9729961Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.9730252Z ) 2025-05-07T20:33:10.9730594Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.9731049Z def test_silu_mul_quant( 2025-05-07T20:33:10.9731289Z self, 2025-05-07T20:33:10.9731473Z T: int, 2025-05-07T20:33:10.9731660Z D: int, 2025-05-07T20:33:10.9731868Z scale_ub: Optional[float], 2025-05-07T20:33:10.9732149Z contiguous: bool, 2025-05-07T20:33:10.9732379Z compiled: bool, 2025-05-07T20:33:10.9732599Z ) -> None: 2025-05-07T20:33:10.9732811Z torch.manual_seed(2025) 2025-05-07T20:33:10.9733043Z 2025-05-07T20:33:10.9733309Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.9735552Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.9737552Z 2025-05-07T20:33:10.9737674Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.9737890Z 2025-05-07T20:33:10.9737996Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.9738417Z self=, 2025-05-07T20:33:10.9738835Z T=4096, 2025-05-07T20:33:10.9739010Z D=5120, 2025-05-07T20:33:10.9739199Z scale_ub=None, 2025-05-07T20:33:10.9739412Z contiguous=True, 2025-05-07T20:33:10.9739627Z compiled=False, 2025-05-07T20:33:10.9739821Z ) 2025-05-07T20:33:10.9740135Z self = 2025-05-07T20:33:10.9740641Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:10.9740918Z 2025-05-07T20:33:10.9741002Z @given( 2025-05-07T20:33:10.9741223Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.9741530Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.9741834Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.9742160Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.9742489Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.9742779Z ) 2025-05-07T20:33:10.9743127Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.9743579Z def test_silu_mul_quant( 2025-05-07T20:33:10.9743810Z self, 2025-05-07T20:33:10.9744002Z T: int, 2025-05-07T20:33:10.9744289Z D: int, 2025-05-07T20:33:10.9744508Z scale_ub: Optional[float], 2025-05-07T20:33:10.9744783Z contiguous: bool, 2025-05-07T20:33:10.9745027Z compiled: bool, 2025-05-07T20:33:10.9745245Z ) -> None: 2025-05-07T20:33:10.9745449Z torch.manual_seed(2025) 2025-05-07T20:33:10.9745804Z 2025-05-07T20:33:10.9746077Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.9748248Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.9750277Z 2025-05-07T20:33:10.9750400Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.9750615Z 2025-05-07T20:33:10.9750718Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.9751148Z self=, 2025-05-07T20:33:10.9751590Z T=2048, 2025-05-07T20:33:10.9751775Z D=5120, 2025-05-07T20:33:10.9751955Z scale_ub=None, 2025-05-07T20:33:10.9752167Z contiguous=False, 2025-05-07T20:33:10.9752384Z compiled=False, 2025-05-07T20:33:10.9752585Z ) 2025-05-07T20:33:10.9752900Z self = 2025-05-07T20:33:10.9753404Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:10.9753686Z 2025-05-07T20:33:10.9753768Z @given( 2025-05-07T20:33:10.9753994Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.9754307Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.9754624Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.9754956Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.9755294Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.9755597Z ) 2025-05-07T20:33:10.9755959Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.9756416Z def test_silu_mul_quant( 2025-05-07T20:33:10.9756658Z self, 2025-05-07T20:33:10.9756854Z T: int, 2025-05-07T20:33:10.9757048Z D: int, 2025-05-07T20:33:10.9757268Z scale_ub: Optional[float], 2025-05-07T20:33:10.9757540Z contiguous: bool, 2025-05-07T20:33:10.9757785Z compiled: bool, 2025-05-07T20:33:10.9758001Z ) -> None: 2025-05-07T20:33:10.9758208Z torch.manual_seed(2025) 2025-05-07T20:33:10.9758441Z 2025-05-07T20:33:10.9758713Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.9760955Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.9762950Z 2025-05-07T20:33:10.9763071Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.9763283Z 2025-05-07T20:33:10.9763386Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.9763798Z self=, 2025-05-07T20:33:10.9764210Z T=4096, 2025-05-07T20:33:10.9764403Z D=7168, 2025-05-07T20:33:10.9764650Z scale_ub=None, 2025-05-07T20:33:10.9764867Z contiguous=True, 2025-05-07T20:33:10.9765094Z compiled=True, 2025-05-07T20:33:10.9765289Z ) 2025-05-07T20:33:10.9765619Z self = 2025-05-07T20:33:10.9766205Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:10.9766485Z 2025-05-07T20:33:10.9766567Z @given( 2025-05-07T20:33:10.9766798Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.9767116Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.9767476Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.9767809Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.9768145Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.9768444Z ) 2025-05-07T20:33:10.9768798Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.9769254Z def test_silu_mul_quant( 2025-05-07T20:33:10.9769510Z self, 2025-05-07T20:33:10.9769701Z T: int, 2025-05-07T20:33:10.9769901Z D: int, 2025-05-07T20:33:10.9770120Z scale_ub: Optional[float], 2025-05-07T20:33:10.9770389Z contiguous: bool, 2025-05-07T20:33:10.9770632Z compiled: bool, 2025-05-07T20:33:10.9770857Z ) -> None: 2025-05-07T20:33:10.9771057Z torch.manual_seed(2025) 2025-05-07T20:33:10.9771295Z 2025-05-07T20:33:10.9771564Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.9773747Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.9775827Z 2025-05-07T20:33:10.9775944Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.9776159Z 2025-05-07T20:33:10.9776258Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.9776681Z self=, 2025-05-07T20:33:10.9777095Z T=2048, 2025-05-07T20:33:10.9777280Z D=5120, 2025-05-07T20:33:10.9777470Z scale_ub=1200.0, 2025-05-07T20:33:10.9777696Z contiguous=False, 2025-05-07T20:33:10.9777919Z compiled=False, 2025-05-07T20:33:10.9778127Z ) 2025-05-07T20:33:10.9778454Z self = 2025-05-07T20:33:10.9778971Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:10.9779259Z 2025-05-07T20:33:10.9779337Z @given( 2025-05-07T20:33:10.9779567Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:10.9779886Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:10.9780191Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:10.9780525Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:10.9780860Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:10.9781152Z ) 2025-05-07T20:33:10.9781505Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:10.9781962Z def test_silu_mul_quant( 2025-05-07T20:33:10.9782205Z self, 2025-05-07T20:33:10.9782405Z T: int, 2025-05-07T20:33:10.9782600Z D: int, 2025-05-07T20:33:10.9782815Z scale_ub: Optional[float], 2025-05-07T20:33:10.9783086Z contiguous: bool, 2025-05-07T20:33:10.9783330Z compiled: bool, 2025-05-07T20:33:10.9783547Z ) -> None: 2025-05-07T20:33:10.9783751Z torch.manual_seed(2025) 2025-05-07T20:33:10.9783984Z 2025-05-07T20:33:10.9784301Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:10.9786573Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:10.9788611Z 2025-05-07T20:33:10.9788733Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:10.9788983Z 2025-05-07T20:33:10.9789112Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:10.9789536Z self=, 2025-05-07T20:33:10.9789957Z T=4096, 2025-05-07T20:33:10.9790144Z D=7168, 2025-05-07T20:33:10.9790333Z scale_ub=1200.0, 2025-05-07T20:33:10.9790546Z contiguous=True, 2025-05-07T20:33:10.9790758Z compiled=False, 2025-05-07T20:33:10.9790955Z ) 2025-05-07T20:33:11.0858249Z self = 2025-05-07T20:33:11.0858938Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:11.0859406Z 2025-05-07T20:33:11.0859527Z @given( 2025-05-07T20:33:11.0859826Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.0860261Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.0860666Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.0860997Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.0861319Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.0861605Z ) 2025-05-07T20:33:11.0861952Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.0862399Z def test_silu_mul_quant( 2025-05-07T20:33:11.0862635Z self, 2025-05-07T20:33:11.0862824Z T: int, 2025-05-07T20:33:11.0863006Z D: int, 2025-05-07T20:33:11.0863214Z scale_ub: Optional[float], 2025-05-07T20:33:11.0863485Z contiguous: bool, 2025-05-07T20:33:11.0863717Z compiled: bool, 2025-05-07T20:33:11.0863939Z ) -> None: 2025-05-07T20:33:11.0864143Z torch.manual_seed(2025) 2025-05-07T20:33:11.0864373Z 2025-05-07T20:33:11.0864640Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.0866822Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.0868826Z 2025-05-07T20:33:11.0868940Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.0869160Z 2025-05-07T20:33:11.0869267Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.0869683Z self=, 2025-05-07T20:33:11.0870092Z T=16384, 2025-05-07T20:33:11.0870284Z D=7168, 2025-05-07T20:33:11.0870469Z scale_ub=None, 2025-05-07T20:33:11.0870683Z contiguous=False, 2025-05-07T20:33:11.0870901Z compiled=True, 2025-05-07T20:33:11.0871091Z ) 2025-05-07T20:33:11.0871403Z self = 2025-05-07T20:33:11.0871906Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:11.0872188Z 2025-05-07T20:33:11.0872378Z @given( 2025-05-07T20:33:11.0872598Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.0872909Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.0873216Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.0873657Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.0873991Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.0874276Z ) 2025-05-07T20:33:11.0874615Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.0875125Z def test_silu_mul_quant( 2025-05-07T20:33:11.0875364Z self, 2025-05-07T20:33:11.0875555Z T: int, 2025-05-07T20:33:11.0875739Z D: int, 2025-05-07T20:33:11.0875949Z scale_ub: Optional[float], 2025-05-07T20:33:11.0876216Z contiguous: bool, 2025-05-07T20:33:11.0876444Z compiled: bool, 2025-05-07T20:33:11.0876654Z ) -> None: 2025-05-07T20:33:11.0876870Z torch.manual_seed(2025) 2025-05-07T20:33:11.0877104Z 2025-05-07T20:33:11.0877374Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.0879558Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.0881551Z 2025-05-07T20:33:11.0881670Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.0881882Z 2025-05-07T20:33:11.0881984Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.0882393Z self=, 2025-05-07T20:33:11.0882808Z T=4096, 2025-05-07T20:33:11.0882987Z D=7168, 2025-05-07T20:33:11.0883165Z scale_ub=None, 2025-05-07T20:33:11.0883373Z contiguous=True, 2025-05-07T20:33:11.0883586Z compiled=False, 2025-05-07T20:33:11.0883779Z ) 2025-05-07T20:33:11.0884098Z self = 2025-05-07T20:33:11.0884601Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:11.0884881Z 2025-05-07T20:33:11.0884960Z @given( 2025-05-07T20:33:11.0885185Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.0885498Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.0893065Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.0893531Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.0893868Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.0894167Z ) 2025-05-07T20:33:11.0894625Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.0895083Z def test_silu_mul_quant( 2025-05-07T20:33:11.0895329Z self, 2025-05-07T20:33:11.0895519Z T: int, 2025-05-07T20:33:11.0895710Z D: int, 2025-05-07T20:33:11.0895935Z scale_ub: Optional[float], 2025-05-07T20:33:11.0896204Z contiguous: bool, 2025-05-07T20:33:11.0896440Z compiled: bool, 2025-05-07T20:33:11.0896661Z ) -> None: 2025-05-07T20:33:11.0896881Z torch.manual_seed(2025) 2025-05-07T20:33:11.0897136Z 2025-05-07T20:33:11.0897418Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.0899670Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.0901758Z 2025-05-07T20:33:11.0901963Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.0902185Z 2025-05-07T20:33:11.0902298Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.0902726Z self=, 2025-05-07T20:33:11.0903190Z T=16384, 2025-05-07T20:33:11.0903386Z D=7168, 2025-05-07T20:33:11.0903576Z scale_ub=None, 2025-05-07T20:33:11.0903792Z contiguous=True, 2025-05-07T20:33:11.0904016Z compiled=False, 2025-05-07T20:33:11.0904221Z ) 2025-05-07T20:33:11.0904546Z self = 2025-05-07T20:33:11.0905056Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:11.0905347Z 2025-05-07T20:33:11.0905425Z @given( 2025-05-07T20:33:11.0905662Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.0905976Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.0906286Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.0906619Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.0906947Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.0907230Z ) 2025-05-07T20:33:11.0907579Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.0908036Z def test_silu_mul_quant( 2025-05-07T20:33:11.0908271Z self, 2025-05-07T20:33:11.0908466Z T: int, 2025-05-07T20:33:11.0908660Z D: int, 2025-05-07T20:33:11.0908874Z scale_ub: Optional[float], 2025-05-07T20:33:11.0909142Z contiguous: bool, 2025-05-07T20:33:11.0909382Z compiled: bool, 2025-05-07T20:33:11.0909602Z ) -> None: 2025-05-07T20:33:11.0909812Z torch.manual_seed(2025) 2025-05-07T20:33:11.0910057Z 2025-05-07T20:33:11.0910329Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.0912507Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.0914507Z 2025-05-07T20:33:11.0914625Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.0914845Z 2025-05-07T20:33:11.0914945Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.0915364Z self=, 2025-05-07T20:33:11.0915782Z T=16384, 2025-05-07T20:33:11.0915974Z D=7168, 2025-05-07T20:33:11.0916175Z scale_ub=1200.0, 2025-05-07T20:33:11.0916398Z contiguous=True, 2025-05-07T20:33:11.0916627Z compiled=False, 2025-05-07T20:33:11.0916838Z ) 2025-05-07T20:33:11.0917167Z self = 2025-05-07T20:33:11.0917679Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:11.0917973Z 2025-05-07T20:33:11.0918055Z @given( 2025-05-07T20:33:11.0918287Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.0918601Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.0918916Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.0919261Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.0919605Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.0919954Z ) 2025-05-07T20:33:11.0920318Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.0920776Z def test_silu_mul_quant( 2025-05-07T20:33:11.0921017Z self, 2025-05-07T20:33:11.0921230Z T: int, 2025-05-07T20:33:11.0921516Z D: int, 2025-05-07T20:33:11.0921742Z scale_ub: Optional[float], 2025-05-07T20:33:11.0922026Z contiguous: bool, 2025-05-07T20:33:11.0922274Z compiled: bool, 2025-05-07T20:33:11.0922499Z ) -> None: 2025-05-07T20:33:11.0922750Z torch.manual_seed(2025) 2025-05-07T20:33:11.0922991Z 2025-05-07T20:33:11.0923262Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.0925707Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.0927713Z 2025-05-07T20:33:11.0927830Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.0928049Z 2025-05-07T20:33:11.0928156Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.0928580Z self=, 2025-05-07T20:33:11.0929046Z T=128, 2025-05-07T20:33:11.0929232Z D=5120, 2025-05-07T20:33:11.0929427Z scale_ub=1200.0, 2025-05-07T20:33:11.0929653Z contiguous=False, 2025-05-07T20:33:11.0929877Z compiled=False, 2025-05-07T20:33:11.0930087Z ) 2025-05-07T20:33:11.2213188Z self = 2025-05-07T20:33:11.2213954Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:11.2214460Z 2025-05-07T20:33:11.2214575Z @given( 2025-05-07T20:33:11.2214881Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.2215258Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.2215567Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.2215893Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.2216218Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.2216509Z ) 2025-05-07T20:33:11.2216850Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.2217317Z def test_silu_mul_quant( 2025-05-07T20:33:11.2217558Z self, 2025-05-07T20:33:11.2217745Z T: int, 2025-05-07T20:33:11.2217938Z D: int, 2025-05-07T20:33:11.2218143Z scale_ub: Optional[float], 2025-05-07T20:33:11.2218415Z contiguous: bool, 2025-05-07T20:33:11.2218643Z compiled: bool, 2025-05-07T20:33:11.2218859Z ) -> None: 2025-05-07T20:33:11.2219068Z torch.manual_seed(2025) 2025-05-07T20:33:11.2219306Z 2025-05-07T20:33:11.2219570Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.2219926Z 2025-05-07T20:33:11.2220119Z x_sign = torch.sign(x) 2025-05-07T20:33:11.2220407Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.2220719Z x = x_sign * x_clamp 2025-05-07T20:33:11.2220961Z x0 = x[:, :D] 2025-05-07T20:33:11.2221174Z x1 = x[:, D:] 2025-05-07T20:33:11.2221375Z 2025-05-07T20:33:11.2221556Z if contiguous: 2025-05-07T20:33:11.2221784Z x0 = x0.contiguous() 2025-05-07T20:33:11.2222041Z x1 = x1.contiguous() 2025-05-07T20:33:11.2222283Z 2025-05-07T20:33:11.2222465Z if scale_ub is not None: 2025-05-07T20:33:11.2222734Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.2223212Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.2223522Z ) 2025-05-07T20:33:11.2223703Z else: 2025-05-07T20:33:11.2223912Z scale_ub_tensor = None 2025-05-07T20:33:11.2224165Z 2025-05-07T20:33:11.2224511Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.2224838Z op = silu_mul_quant 2025-05-07T20:33:11.2225101Z if compiled: 2025-05-07T20:33:11.2225347Z op = torch.compile(op) 2025-05-07T20:33:11.2225828Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.2226183Z 2025-05-07T20:33:11.2226374Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.2226547Z 2025-05-07T20:33:11.2226646Z moe/activation_test.py:117: 2025-05-07T20:33:11.2226944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.2227289Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.2227577Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.2228308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.2229036Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.2229598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.2230315Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.2231006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.2231573Z kernel = self.compile( 2025-05-07T20:33:11.2232128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.2232809Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.2233219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.2233458Z 2025-05-07T20:33:11.2233668Z self = 2025-05-07T20:33:11.2234799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.2236226Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98569cccc0>} 2025-05-07T20:33:11.2237625Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.2238701Z context = 2025-05-07T20:33:11.2239000Z 2025-05-07T20:33:11.2239168Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.2239711Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.2240195Z module_map=module_map) 2025-05-07T20:33:11.2240576Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.2240932Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.2241203Z E ^ 2025-05-07T20:33:11.2241686Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.2242159Z 2025-05-07T20:33:11.2242594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.2243140Z 2025-05-07T20:33:11.2243243Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.2243668Z self=, 2025-05-07T20:33:11.2244166Z T=2048, 2025-05-07T20:33:11.2244346Z D=7168, 2025-05-07T20:33:11.2244538Z scale_ub=None, 2025-05-07T20:33:11.2244758Z contiguous=False, 2025-05-07T20:33:11.2244985Z compiled=False, 2025-05-07T20:33:11.2245180Z ) 2025-05-07T20:33:11.2245619Z self = 2025-05-07T20:33:11.2246130Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:11.2246415Z 2025-05-07T20:33:11.2246490Z @given( 2025-05-07T20:33:11.2246719Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.2247073Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.2247379Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.2247706Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.2248041Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.2248339Z ) 2025-05-07T20:33:11.2248686Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.2249148Z def test_silu_mul_quant( 2025-05-07T20:33:11.2249397Z self, 2025-05-07T20:33:11.2249589Z T: int, 2025-05-07T20:33:11.2249790Z D: int, 2025-05-07T20:33:11.2250012Z scale_ub: Optional[float], 2025-05-07T20:33:11.2250293Z contiguous: bool, 2025-05-07T20:33:11.2250535Z compiled: bool, 2025-05-07T20:33:11.2250756Z ) -> None: 2025-05-07T20:33:11.2250961Z torch.manual_seed(2025) 2025-05-07T20:33:11.2251199Z 2025-05-07T20:33:11.2251469Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.2253659Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.2255691Z 2025-05-07T20:33:11.2255817Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.2256030Z 2025-05-07T20:33:11.2256128Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.2256541Z self=, 2025-05-07T20:33:11.2256953Z T=128, 2025-05-07T20:33:11.2257132Z D=7168, 2025-05-07T20:33:11.2257319Z scale_ub=1200.0, 2025-05-07T20:33:11.2257533Z contiguous=True, 2025-05-07T20:33:11.2257744Z compiled=True, 2025-05-07T20:33:11.2257938Z ) 2025-05-07T20:33:11.2572579Z self = 2025-05-07T20:33:11.2573172Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:11.2573942Z 2025-05-07T20:33:11.2574241Z @given( 2025-05-07T20:33:11.2574688Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.2575130Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.2575542Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.2575982Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.2576309Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.2576597Z ) 2025-05-07T20:33:11.2576939Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.2577395Z def test_silu_mul_quant( 2025-05-07T20:33:11.2577625Z self, 2025-05-07T20:33:11.2577814Z T: int, 2025-05-07T20:33:11.2578005Z D: int, 2025-05-07T20:33:11.2578214Z scale_ub: Optional[float], 2025-05-07T20:33:11.2578483Z contiguous: bool, 2025-05-07T20:33:11.2578719Z compiled: bool, 2025-05-07T20:33:11.2578930Z ) -> None: 2025-05-07T20:33:11.2579262Z torch.manual_seed(2025) 2025-05-07T20:33:11.2579506Z 2025-05-07T20:33:11.2579774Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.2580118Z 2025-05-07T20:33:11.2580305Z x_sign = torch.sign(x) 2025-05-07T20:33:11.2580702Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.2581011Z x = x_sign * x_clamp 2025-05-07T20:33:11.2581244Z x0 = x[:, :D] 2025-05-07T20:33:11.2581448Z x1 = x[:, D:] 2025-05-07T20:33:11.2581648Z 2025-05-07T20:33:11.2581824Z if contiguous: 2025-05-07T20:33:11.2582108Z x0 = x0.contiguous() 2025-05-07T20:33:11.2582360Z x1 = x1.contiguous() 2025-05-07T20:33:11.2582602Z 2025-05-07T20:33:11.2582784Z if scale_ub is not None: 2025-05-07T20:33:11.2583054Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.2583384Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.2583693Z ) 2025-05-07T20:33:11.2583873Z else: 2025-05-07T20:33:11.2584076Z scale_ub_tensor = None 2025-05-07T20:33:11.2584321Z 2025-05-07T20:33:11.2584541Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.2584853Z op = silu_mul_quant 2025-05-07T20:33:11.2585105Z if compiled: 2025-05-07T20:33:11.2585340Z op = torch.compile(op) 2025-05-07T20:33:11.2585634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.2585910Z 2025-05-07T20:33:11.2586093Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.2586262Z 2025-05-07T20:33:11.2586358Z moe/activation_test.py:117: 2025-05-07T20:33:11.2586652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.2586979Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.2587258Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.2587831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.2588413Z return fn(*args, **kwargs) 2025-05-07T20:33:11.2589085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.2589798Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.2590353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.2591057Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.2591743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.2592293Z kernel = self.compile( 2025-05-07T20:33:11.2592846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.2593519Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.2593924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.2594158Z 2025-05-07T20:33:11.2594364Z self = 2025-05-07T20:33:11.2595488Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.2596906Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98569cda80>} 2025-05-07T20:33:11.2598317Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.2599442Z context = 2025-05-07T20:33:11.2599792Z 2025-05-07T20:33:11.2599964Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.2600501Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.2601052Z module_map=module_map) 2025-05-07T20:33:11.2601428Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.2601789Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.2602048Z E ^ 2025-05-07T20:33:11.2602526Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.2603065Z 2025-05-07T20:33:11.2603504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.2604045Z 2025-05-07T20:33:11.2604152Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.2604580Z self=, 2025-05-07T20:33:11.2604999Z T=128, 2025-05-07T20:33:11.2605192Z D=7168, 2025-05-07T20:33:11.2605378Z scale_ub=1200.0, 2025-05-07T20:33:11.2605602Z contiguous=True, 2025-05-07T20:33:11.2605820Z compiled=False, 2025-05-07T20:33:11.2606019Z ) 2025-05-07T20:33:11.2606347Z self = 2025-05-07T20:33:11.2606857Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:11.2607142Z 2025-05-07T20:33:11.2607230Z @given( 2025-05-07T20:33:11.2607455Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.2607773Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.2608090Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.2608421Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.2608756Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.2609054Z ) 2025-05-07T20:33:11.2609402Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.2609859Z def test_silu_mul_quant( 2025-05-07T20:33:11.2610106Z self, 2025-05-07T20:33:11.2610290Z T: int, 2025-05-07T20:33:11.2610495Z D: int, 2025-05-07T20:33:11.2610716Z scale_ub: Optional[float], 2025-05-07T20:33:11.2610984Z contiguous: bool, 2025-05-07T20:33:11.2611228Z compiled: bool, 2025-05-07T20:33:11.2611449Z ) -> None: 2025-05-07T20:33:11.2611660Z torch.manual_seed(2025) 2025-05-07T20:33:11.2611890Z 2025-05-07T20:33:11.2612163Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.2612515Z 2025-05-07T20:33:11.2612702Z x_sign = torch.sign(x) 2025-05-07T20:33:11.2612987Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.2615273Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.2617261Z 2025-05-07T20:33:11.2617381Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:11.2617598Z 2025-05-07T20:33:11.2617706Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.2618118Z self=, 2025-05-07T20:33:11.2618527Z T=128, 2025-05-07T20:33:11.2618713Z D=5120, 2025-05-07T20:33:11.2618910Z scale_ub=1200.0, 2025-05-07T20:33:11.2619133Z contiguous=True, 2025-05-07T20:33:11.2619356Z compiled=True, 2025-05-07T20:33:11.2619609Z ) 2025-05-07T20:33:11.2619932Z self = 2025-05-07T20:33:11.2620443Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:11.2620723Z 2025-05-07T20:33:11.2620804Z @given( 2025-05-07T20:33:11.2621121Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.2621443Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.2621755Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.2622087Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.2622474Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.2622767Z ) 2025-05-07T20:33:11.2623123Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.2623582Z def test_silu_mul_quant( 2025-05-07T20:33:11.2623834Z self, 2025-05-07T20:33:11.2624031Z T: int, 2025-05-07T20:33:11.2624241Z D: int, 2025-05-07T20:33:11.2624469Z scale_ub: Optional[float], 2025-05-07T20:33:11.2624746Z contiguous: bool, 2025-05-07T20:33:11.2624991Z compiled: bool, 2025-05-07T20:33:11.2625211Z ) -> None: 2025-05-07T20:33:11.2625629Z torch.manual_seed(2025) 2025-05-07T20:33:11.2625910Z 2025-05-07T20:33:11.2626199Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.2626549Z 2025-05-07T20:33:11.2626745Z x_sign = torch.sign(x) 2025-05-07T20:33:11.2627039Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.2629208Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.2631184Z 2025-05-07T20:33:11.2631303Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:11.2631518Z 2025-05-07T20:33:11.2631622Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.2632039Z self=, 2025-05-07T20:33:11.2632456Z T=128, 2025-05-07T20:33:11.2632645Z D=7168, 2025-05-07T20:33:11.2632840Z scale_ub=None, 2025-05-07T20:33:11.2633050Z contiguous=True, 2025-05-07T20:33:11.2633272Z compiled=True, 2025-05-07T20:33:11.2633470Z ) 2025-05-07T20:33:11.7800991Z self = 2025-05-07T20:33:11.7801884Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:11.7802309Z 2025-05-07T20:33:11.7809049Z @given( 2025-05-07T20:33:11.7809425Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.7809751Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.7810062Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.7810392Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.7810729Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.7811014Z ) 2025-05-07T20:33:11.7811377Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.7811829Z def test_silu_mul_quant( 2025-05-07T20:33:11.7812075Z self, 2025-05-07T20:33:11.7812272Z T: int, 2025-05-07T20:33:11.7812477Z D: int, 2025-05-07T20:33:11.7812699Z scale_ub: Optional[float], 2025-05-07T20:33:11.7812970Z contiguous: bool, 2025-05-07T20:33:11.7813208Z compiled: bool, 2025-05-07T20:33:11.7813438Z ) -> None: 2025-05-07T20:33:11.7813655Z torch.manual_seed(2025) 2025-05-07T20:33:11.7814025Z 2025-05-07T20:33:11.7814307Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.7816725Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.7818780Z 2025-05-07T20:33:11.7818912Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.7819130Z 2025-05-07T20:33:11.7848714Z FAILED 2025-05-07T20:33:11.7849105Z 2025-05-07T20:33:11.7849495Z =================================== FAILURES =================================== 2025-05-07T20:33:11.7850045Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:11.7850690Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:11.7851475Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:33:11.7852118Z | yield 2025-05-07T20:33:11.7852646Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run 2025-05-07T20:33:11.7853319Z | self._callTestMethod(testMethod) 2025-05-07T20:33:11.7854104Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod 2025-05-07T20:33:11.7854943Z | if method() is not None: 2025-05-07T20:33:11.7855217Z | ^^^^^^^^ 2025-05-07T20:33:11.7856134Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:11.7857216Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.7857650Z | ^^^^^^^ 2025-05-07T20:33:11.7858467Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:11.7859397Z | raise the_error_hypothesis_found 2025-05-07T20:33:11.7860017Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:11.7860625Z +-+---------------- 1 ---------------- 2025-05-07T20:33:11.7861032Z | Traceback (most recent call last): 2025-05-07T20:33:11.7862060Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:11.7863190Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.7863714Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:11.7866653Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.7869653Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:11.7870294Z | self=, 2025-05-07T20:33:11.7870888Z | T=2048, 2025-05-07T20:33:11.7871212Z | D=5120, # or any other generated value 2025-05-07T20:33:11.7871693Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:11.7872427Z | contiguous=True, # or any other generated value 2025-05-07T20:33:11.7872956Z | compiled=False, # or any other generated value 2025-05-07T20:33:11.7873381Z | ) 2025-05-07T20:33:11.7873639Z | 2025-05-07T20:33:11.7874605Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:11.7875481Z +---------------- 2 ---------------- 2025-05-07T20:33:11.7875900Z | Traceback (most recent call last): 2025-05-07T20:33:11.7876928Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:11.7878165Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.7878689Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:11.7881651Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.7884530Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:11.7885140Z | self=, 2025-05-07T20:33:11.7885568Z | T=128, 2025-05-07T20:33:11.7885780Z | D=7168, 2025-05-07T20:33:11.7886000Z | scale_ub=None, 2025-05-07T20:33:11.7886252Z | contiguous=True, 2025-05-07T20:33:11.7886496Z | compiled=True, 2025-05-07T20:33:11.7886732Z | ) 2025-05-07T20:33:11.7886937Z | 2025-05-07T20:33:11.7887484Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:11.7888133Z +---------------- 3 ---------------- 2025-05-07T20:33:11.7888447Z | Traceback (most recent call last): 2025-05-07T20:33:11.7889966Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:11.7890801Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.7891206Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:11.7893346Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.7895568Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:11.7896023Z | self=, 2025-05-07T20:33:11.7896459Z | T=128, 2025-05-07T20:33:11.7896675Z | D=5120, 2025-05-07T20:33:11.7896889Z | scale_ub=1200.0, 2025-05-07T20:33:11.7897145Z | contiguous=True, 2025-05-07T20:33:11.7897402Z | compiled=True, 2025-05-07T20:33:11.7897637Z | ) 2025-05-07T20:33:11.7897817Z | 2025-05-07T20:33:11.7898364Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:11.7899114Z +---------------- 4 ---------------- 2025-05-07T20:33:11.7899416Z | Traceback (most recent call last): 2025-05-07T20:33:11.7900171Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:11.7901021Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:11.7901319Z | ^^^^^^^^ 2025-05-07T20:33:11.7901999Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:11.7902785Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.7903142Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:11.7903983Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:11.7904839Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:11.7905494Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:11.7906284Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.7906743Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:11.7907423Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:11.7908252Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:11.7908746Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:11.7909426Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:11.7910171Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:11.7910563Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:11.7911197Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:11.7911807Z | fn() 2025-05-07T20:33:11.7912415Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:11.7913095Z | self.fn.run( 2025-05-07T20:33:11.7913649Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:11.7914279Z | kernel = self.compile( 2025-05-07T20:33:11.7914561Z | ^^^^^^^^^^^^^ 2025-05-07T20:33:11.7915396Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:11.7916422Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.7916979Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:11.7917915Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:11.7919075Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.7919806Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:11.7920355Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.7920843Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:11.7921224Z | ^ 2025-05-07T20:33:11.7921911Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.7922816Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:11.7923388Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:11.7924126Z | self=, 2025-05-07T20:33:11.7924849Z | T=1, # or any other generated value 2025-05-07T20:33:11.7925303Z | D=5120, # or any other generated value 2025-05-07T20:33:11.7926084Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:11.7926616Z | contiguous=True, # or any other generated value 2025-05-07T20:33:11.7927326Z | compiled=True, # or any other generated value 2025-05-07T20:33:11.7927750Z | ) 2025-05-07T20:33:11.7928016Z | 2025-05-07T20:33:11.7928786Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:11.7929718Z +------------------------------------ 2025-05-07T20:33:11.7930223Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:11.7930770Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.7931362Z self=, 2025-05-07T20:33:11.7931939Z T=1, 2025-05-07T20:33:11.7932218Z D=5120, 2025-05-07T20:33:11.7932499Z scale_ub=None, 2025-05-07T20:33:11.7932799Z contiguous=True, 2025-05-07T20:33:11.7933124Z compiled=True, 2025-05-07T20:33:11.7933418Z ) 2025-05-07T20:33:11.7933869Z self = 2025-05-07T20:33:11.7934670Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:11.7935059Z 2025-05-07T20:33:11.7935174Z @given( 2025-05-07T20:33:11.7935505Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.7935945Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.7936394Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.7936875Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.7937337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.7937749Z ) 2025-05-07T20:33:11.7938246Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.7938899Z def test_silu_mul_quant( 2025-05-07T20:33:11.7939307Z self, 2025-05-07T20:33:11.7939595Z T: int, 2025-05-07T20:33:11.7939886Z D: int, 2025-05-07T20:33:11.7940199Z scale_ub: Optional[float], 2025-05-07T20:33:11.7940595Z contiguous: bool, 2025-05-07T20:33:11.7940957Z compiled: bool, 2025-05-07T20:33:11.7941277Z ) -> None: 2025-05-07T20:33:11.7941588Z torch.manual_seed(2025) 2025-05-07T20:33:11.7941937Z 2025-05-07T20:33:11.7942315Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.7942814Z 2025-05-07T20:33:11.7943092Z x_sign = torch.sign(x) 2025-05-07T20:33:11.7943474Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.7943915Z x = x_sign * x_clamp 2025-05-07T20:33:11.7944257Z x0 = x[:, :D] 2025-05-07T20:33:11.7944563Z x1 = x[:, D:] 2025-05-07T20:33:11.7944857Z 2025-05-07T20:33:11.7945130Z if contiguous: 2025-05-07T20:33:11.7945461Z x0 = x0.contiguous() 2025-05-07T20:33:11.7945841Z x1 = x1.contiguous() 2025-05-07T20:33:11.7946191Z 2025-05-07T20:33:11.7946462Z if scale_ub is not None: 2025-05-07T20:33:11.7946865Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.7947347Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.7947791Z ) 2025-05-07T20:33:11.7948072Z else: 2025-05-07T20:33:11.7948378Z scale_ub_tensor = None 2025-05-07T20:33:11.7948761Z 2025-05-07T20:33:11.7949082Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.7949541Z op = silu_mul_quant 2025-05-07T20:33:11.7949995Z if compiled: 2025-05-07T20:33:11.7950339Z op = torch.compile(op) 2025-05-07T20:33:11.7950753Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.7951154Z 2025-05-07T20:33:11.7951427Z y_fp8, y_scale = fn() 2025-05-07T20:33:11.7951979Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:11.7952395Z 2025-05-07T20:33:11.7952714Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.7953180Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:11.7953652Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:11.7954088Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:11.7954582Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.7955016Z 2025-05-07T20:33:11.7955307Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:11.7955582Z 2025-05-07T20:33:11.7955724Z moe/activation_test.py:126: 2025-05-07T20:33:11.7956152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.7956640Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:11.7957101Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.7958273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:11.7959370Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:11.7960135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.7961096Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.7962097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:11.7963151Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:11.7964190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:11.7965116Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:11.7966013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:11.7966762Z fn() 2025-05-07T20:33:11.7967500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:11.7968351Z self.fn.run( 2025-05-07T20:33:11.7969011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.7969756Z kernel = self.compile( 2025-05-07T20:33:11.7970504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.7971400Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.7971964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.7972276Z 2025-05-07T20:33:11.7972544Z self = 2025-05-07T20:33:11.7974006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.7975969Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f99497ecc20>} 2025-05-07T20:33:11.7977814Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.7979315Z context = 2025-05-07T20:33:11.7979699Z 2025-05-07T20:33:11.7979916Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.7980617Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.7981331Z module_map=module_map) 2025-05-07T20:33:11.7981816Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.7982284Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:11.7982714Z E ^ 2025-05-07T20:33:11.7983344Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.7983977Z 2025-05-07T20:33:11.7984546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.7985288Z 2025-05-07T20:33:11.7985431Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.7986019Z self=, 2025-05-07T20:33:11.7986608Z T=2048, 2025-05-07T20:33:11.7986867Z D=5120, 2025-05-07T20:33:11.7987141Z scale_ub=1200.0, 2025-05-07T20:33:11.7987460Z contiguous=True, 2025-05-07T20:33:11.7987779Z compiled=False, 2025-05-07T20:33:11.7988076Z ) 2025-05-07T20:33:11.7988516Z self = 2025-05-07T20:33:11.7989251Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:11.7989648Z 2025-05-07T20:33:11.7989750Z @given( 2025-05-07T20:33:11.7990062Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.7990490Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.7990893Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.7991345Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.7991798Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.7992197Z ) 2025-05-07T20:33:11.7992677Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.7993296Z def test_silu_mul_quant( 2025-05-07T20:33:11.7993625Z self, 2025-05-07T20:33:11.7993888Z T: int, 2025-05-07T20:33:11.7994170Z D: int, 2025-05-07T20:33:11.7994459Z scale_ub: Optional[float], 2025-05-07T20:33:11.7994837Z contiguous: bool, 2025-05-07T20:33:11.7995169Z compiled: bool, 2025-05-07T20:33:11.7995470Z ) -> None: 2025-05-07T20:33:11.7995763Z torch.manual_seed(2025) 2025-05-07T20:33:11.7996086Z 2025-05-07T20:33:11.7996441Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.7996908Z 2025-05-07T20:33:11.7997164Z x_sign = torch.sign(x) 2025-05-07T20:33:11.7997552Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.7997961Z x = x_sign * x_clamp 2025-05-07T20:33:11.7998294Z x0 = x[:, :D] 2025-05-07T20:33:11.7998587Z x1 = x[:, D:] 2025-05-07T20:33:11.7998858Z 2025-05-07T20:33:11.7999106Z if contiguous: 2025-05-07T20:33:11.7999418Z x0 = x0.contiguous() 2025-05-07T20:33:11.7999761Z x1 = x1.contiguous() 2025-05-07T20:33:11.8000084Z 2025-05-07T20:33:11.8000358Z if scale_ub is not None: 2025-05-07T20:33:11.8000733Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8001188Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8001615Z ) 2025-05-07T20:33:11.8001886Z else: 2025-05-07T20:33:11.8002189Z scale_ub_tensor = None 2025-05-07T20:33:11.8002560Z 2025-05-07T20:33:11.8002881Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8004864Z op = silu_mul_quant 2025-05-07T20:33:11.8005240Z if compiled: 2025-05-07T20:33:11.8005595Z op = torch.compile(op) 2025-05-07T20:33:11.8006066Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8006475Z 2025-05-07T20:33:11.8006739Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8006965Z 2025-05-07T20:33:11.8007115Z moe/activation_test.py:117: 2025-05-07T20:33:11.8007620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8008088Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8008474Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8009444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8010448Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8011186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8012112Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8013025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8013789Z kernel = self.compile( 2025-05-07T20:33:11.8014701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8015675Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8016233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8016559Z 2025-05-07T20:33:11.8016853Z self = 2025-05-07T20:33:11.8018383Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8020280Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f99498a8180>} 2025-05-07T20:33:11.8022233Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8023755Z context = 2025-05-07T20:33:11.8024172Z 2025-05-07T20:33:11.8024418Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8025173Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8026096Z module_map=module_map) 2025-05-07T20:33:11.8026627Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8027125Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8027512Z E ^ 2025-05-07T20:33:11.8028200Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8028827Z 2025-05-07T20:33:11.8029451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8030160Z 2025-05-07T20:33:11.8030305Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8030883Z self=, 2025-05-07T20:33:11.8031431Z T=2048, 2025-05-07T20:33:11.8031679Z D=5120, 2025-05-07T20:33:11.8052866Z scale_ub=1200.0, 2025-05-07T20:33:11.8053194Z contiguous=True, 2025-05-07T20:33:11.8053478Z compiled=True, 2025-05-07T20:33:11.8053731Z ) 2025-05-07T20:33:11.8054166Z self = 2025-05-07T20:33:11.8055029Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:11.8055414Z 2025-05-07T20:33:11.8055531Z @given( 2025-05-07T20:33:11.8056089Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8056545Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8056975Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8057437Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8058133Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8058540Z ) 2025-05-07T20:33:11.8059029Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8059679Z def test_silu_mul_quant( 2025-05-07T20:33:11.8060114Z self, 2025-05-07T20:33:11.8060399Z T: int, 2025-05-07T20:33:11.8060686Z D: int, 2025-05-07T20:33:11.8060987Z scale_ub: Optional[float], 2025-05-07T20:33:11.8061393Z contiguous: bool, 2025-05-07T20:33:11.8061753Z compiled: bool, 2025-05-07T20:33:11.8062066Z ) -> None: 2025-05-07T20:33:11.8062362Z torch.manual_seed(2025) 2025-05-07T20:33:11.8062703Z 2025-05-07T20:33:11.8063066Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8063550Z 2025-05-07T20:33:11.8063820Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8064211Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8064630Z x = x_sign * x_clamp 2025-05-07T20:33:11.8064972Z x0 = x[:, :D] 2025-05-07T20:33:11.8065266Z x1 = x[:, D:] 2025-05-07T20:33:11.8065540Z 2025-05-07T20:33:11.8065798Z if contiguous: 2025-05-07T20:33:11.8066103Z x0 = x0.contiguous() 2025-05-07T20:33:11.8066437Z x1 = x1.contiguous() 2025-05-07T20:33:11.8066757Z 2025-05-07T20:33:11.8067003Z if scale_ub is not None: 2025-05-07T20:33:11.8067353Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8067800Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8068217Z ) 2025-05-07T20:33:11.8068458Z else: 2025-05-07T20:33:11.8068731Z scale_ub_tensor = None 2025-05-07T20:33:11.8069089Z 2025-05-07T20:33:11.8069418Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8069854Z op = silu_mul_quant 2025-05-07T20:33:11.8070199Z if compiled: 2025-05-07T20:33:11.8070543Z op = torch.compile(op) 2025-05-07T20:33:11.8070953Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8071342Z 2025-05-07T20:33:11.8071608Z y_fp8, y_scale = fn() 2025-05-07T20:33:11.8071990Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:11.8072381Z 2025-05-07T20:33:11.8072679Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8073015Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:11.8073314Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:11.8073640Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:11.8074001Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8074314Z 2025-05-07T20:33:11.8074511Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:11.8074708Z 2025-05-07T20:33:11.8074815Z moe/activation_test.py:126: 2025-05-07T20:33:11.8075103Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8075450Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:11.8075782Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8076598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:11.8077392Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:11.8077957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8078677Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8079383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:11.8080214Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:11.8081058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:11.8081739Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:11.8082361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:11.8082950Z fn() 2025-05-07T20:33:11.8083483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:11.8084088Z self.fn.run( 2025-05-07T20:33:11.8084562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8085121Z kernel = self.compile( 2025-05-07T20:33:11.8085676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8086363Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8086773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8087012Z 2025-05-07T20:33:11.8087225Z self = 2025-05-07T20:33:11.8088343Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8089839Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9948439580>} 2025-05-07T20:33:11.8091246Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8092332Z context = 2025-05-07T20:33:11.8092628Z 2025-05-07T20:33:11.8092804Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8093345Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8093826Z module_map=module_map) 2025-05-07T20:33:11.8094196Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8094691Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:11.8094966Z E ^ 2025-05-07T20:33:11.8095442Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8095913Z 2025-05-07T20:33:11.8096363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8096908Z 2025-05-07T20:33:11.8097012Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8097439Z self=, 2025-05-07T20:33:11.8097861Z T=16384, 2025-05-07T20:33:11.8098052Z D=7168, 2025-05-07T20:33:11.8098257Z scale_ub=1200.0, 2025-05-07T20:33:11.8098486Z contiguous=False, 2025-05-07T20:33:11.8098712Z compiled=False, 2025-05-07T20:33:11.8098929Z ) 2025-05-07T20:33:11.8099292Z self = 2025-05-07T20:33:11.8099814Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:11.8100104Z 2025-05-07T20:33:11.8100183Z @given( 2025-05-07T20:33:11.8100415Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8100731Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8101086Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8101418Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8101747Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8102038Z ) 2025-05-07T20:33:11.8102466Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8102925Z def test_silu_mul_quant( 2025-05-07T20:33:11.8103174Z self, 2025-05-07T20:33:11.8103367Z T: int, 2025-05-07T20:33:11.8103562Z D: int, 2025-05-07T20:33:11.8103811Z scale_ub: Optional[float], 2025-05-07T20:33:11.8104074Z contiguous: bool, 2025-05-07T20:33:11.8104312Z compiled: bool, 2025-05-07T20:33:11.8104531Z ) -> None: 2025-05-07T20:33:11.8104740Z torch.manual_seed(2025) 2025-05-07T20:33:11.8104978Z 2025-05-07T20:33:11.8105257Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8105611Z 2025-05-07T20:33:11.8105819Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8106117Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8106434Z x = x_sign * x_clamp 2025-05-07T20:33:11.8106686Z x0 = x[:, :D] 2025-05-07T20:33:11.8106906Z x1 = x[:, D:] 2025-05-07T20:33:11.8107108Z 2025-05-07T20:33:11.8107301Z if contiguous: 2025-05-07T20:33:11.8107541Z x0 = x0.contiguous() 2025-05-07T20:33:11.8107802Z x1 = x1.contiguous() 2025-05-07T20:33:11.8108040Z 2025-05-07T20:33:11.8108239Z if scale_ub is not None: 2025-05-07T20:33:11.8108529Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8108861Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8109180Z ) 2025-05-07T20:33:11.8109368Z else: 2025-05-07T20:33:11.8109569Z scale_ub_tensor = None 2025-05-07T20:33:11.8109828Z 2025-05-07T20:33:11.8110066Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8110389Z op = silu_mul_quant 2025-05-07T20:33:11.8110649Z if compiled: 2025-05-07T20:33:11.8110904Z op = torch.compile(op) 2025-05-07T20:33:11.8111202Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8111491Z 2025-05-07T20:33:11.8111694Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8111862Z 2025-05-07T20:33:11.8111973Z moe/activation_test.py:117: 2025-05-07T20:33:11.8112276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8112622Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8112915Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8113635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8114366Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8114931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8115655Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8116347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8116914Z kernel = self.compile( 2025-05-07T20:33:11.8117490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8118176Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8118595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8118840Z 2025-05-07T20:33:11.8119050Z self = 2025-05-07T20:33:11.8120181Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8121658Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9948439c60>} 2025-05-07T20:33:11.8123149Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8124244Z context = 2025-05-07T20:33:11.8124580Z 2025-05-07T20:33:11.8124761Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8125296Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8126134Z module_map=module_map) 2025-05-07T20:33:11.8126516Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8126891Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8127153Z E ^ 2025-05-07T20:33:11.8127635Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8128105Z 2025-05-07T20:33:11.8128556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8129101Z 2025-05-07T20:33:11.8129214Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8129632Z self=, 2025-05-07T20:33:11.8130066Z T=1, 2025-05-07T20:33:11.8130260Z D=7168, 2025-05-07T20:33:11.8130454Z scale_ub=None, 2025-05-07T20:33:11.8130679Z contiguous=True, 2025-05-07T20:33:11.8130907Z compiled=True, 2025-05-07T20:33:11.8131102Z ) 2025-05-07T20:33:11.8131431Z self = 2025-05-07T20:33:11.8131939Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:11.8132209Z 2025-05-07T20:33:11.8132286Z @given( 2025-05-07T20:33:11.8132515Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8132842Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8133160Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8133489Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8133828Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8134130Z ) 2025-05-07T20:33:11.8134552Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8135013Z def test_silu_mul_quant( 2025-05-07T20:33:11.8135251Z self, 2025-05-07T20:33:11.8135448Z T: int, 2025-05-07T20:33:11.8135648Z D: int, 2025-05-07T20:33:11.8135860Z scale_ub: Optional[float], 2025-05-07T20:33:11.8136137Z contiguous: bool, 2025-05-07T20:33:11.8136379Z compiled: bool, 2025-05-07T20:33:11.8136593Z ) -> None: 2025-05-07T20:33:11.8136812Z torch.manual_seed(2025) 2025-05-07T20:33:11.8137053Z 2025-05-07T20:33:11.8137324Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8137679Z 2025-05-07T20:33:11.8137877Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8138163Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8138489Z x = x_sign * x_clamp 2025-05-07T20:33:11.8138739Z x0 = x[:, :D] 2025-05-07T20:33:11.8138972Z x1 = x[:, D:] 2025-05-07T20:33:11.8139222Z 2025-05-07T20:33:11.8139413Z if contiguous: 2025-05-07T20:33:11.8139651Z x0 = x0.contiguous() 2025-05-07T20:33:11.8139913Z x1 = x1.contiguous() 2025-05-07T20:33:11.8140164Z 2025-05-07T20:33:11.8140361Z if scale_ub is not None: 2025-05-07T20:33:11.8140636Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8141101Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8141419Z ) 2025-05-07T20:33:11.8141612Z else: 2025-05-07T20:33:11.8141821Z scale_ub_tensor = None 2025-05-07T20:33:11.8142075Z 2025-05-07T20:33:11.8142452Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8142778Z op = silu_mul_quant 2025-05-07T20:33:11.8143030Z if compiled: 2025-05-07T20:33:11.8143273Z op = torch.compile(op) 2025-05-07T20:33:11.8143573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8143917Z 2025-05-07T20:33:11.8144103Z y_fp8, y_scale = fn() 2025-05-07T20:33:11.8144395Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:11.8144694Z 2025-05-07T20:33:11.8144939Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8145282Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:11.8145584Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:11.8145912Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:11.8146273Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8146595Z 2025-05-07T20:33:11.8146795Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:11.8146999Z 2025-05-07T20:33:11.8147103Z moe/activation_test.py:126: 2025-05-07T20:33:11.8147410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8147751Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:11.8148087Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8148918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:11.8149748Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:11.8150322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8151037Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8151764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:11.8152531Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:11.8153304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:11.8153969Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:11.8154606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:11.8155151Z fn() 2025-05-07T20:33:11.8155688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:11.8156295Z self.fn.run( 2025-05-07T20:33:11.8156785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8157344Z kernel = self.compile( 2025-05-07T20:33:11.8157900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8158599Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8159040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8159300Z 2025-05-07T20:33:11.8159515Z self = 2025-05-07T20:33:11.8160635Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8162064Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f994843ad40>} 2025-05-07T20:33:11.8163630Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8164719Z context = 2025-05-07T20:33:11.8165017Z 2025-05-07T20:33:11.8165194Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8165768Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8166254Z module_map=module_map) 2025-05-07T20:33:11.8166625Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8166984Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:11.8167263Z E ^ 2025-05-07T20:33:11.8167740Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8168215Z 2025-05-07T20:33:11.8168657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8169201Z 2025-05-07T20:33:11.8169308Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8169736Z self=, 2025-05-07T20:33:11.8170154Z T=4096, 2025-05-07T20:33:11.8170334Z D=5120, 2025-05-07T20:33:11.8170532Z scale_ub=None, 2025-05-07T20:33:11.8170749Z contiguous=False, 2025-05-07T20:33:11.8170970Z compiled=False, 2025-05-07T20:33:11.8171175Z ) 2025-05-07T20:33:11.8171498Z self = 2025-05-07T20:33:11.8172002Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:11.8172296Z 2025-05-07T20:33:11.8172377Z @given( 2025-05-07T20:33:11.8172606Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8172922Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8173233Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8173569Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8173908Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8174193Z ) 2025-05-07T20:33:11.8174619Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8175076Z def test_silu_mul_quant( 2025-05-07T20:33:11.8175321Z self, 2025-05-07T20:33:11.8175521Z T: int, 2025-05-07T20:33:11.8175721Z D: int, 2025-05-07T20:33:11.8175933Z scale_ub: Optional[float], 2025-05-07T20:33:11.8176211Z contiguous: bool, 2025-05-07T20:33:11.8176451Z compiled: bool, 2025-05-07T20:33:11.8176676Z ) -> None: 2025-05-07T20:33:11.8176885Z torch.manual_seed(2025) 2025-05-07T20:33:11.8177132Z 2025-05-07T20:33:11.8177411Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8177762Z 2025-05-07T20:33:11.8177958Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8178248Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8178561Z x = x_sign * x_clamp 2025-05-07T20:33:11.8178802Z x0 = x[:, :D] 2025-05-07T20:33:11.8179015Z x1 = x[:, D:] 2025-05-07T20:33:11.8179215Z 2025-05-07T20:33:11.8179398Z if contiguous: 2025-05-07T20:33:11.8179629Z x0 = x0.contiguous() 2025-05-07T20:33:11.8179885Z x1 = x1.contiguous() 2025-05-07T20:33:11.8180135Z 2025-05-07T20:33:11.8180330Z if scale_ub is not None: 2025-05-07T20:33:11.8180601Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8180940Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8181254Z ) 2025-05-07T20:33:11.8181511Z else: 2025-05-07T20:33:11.8181718Z scale_ub_tensor = None 2025-05-07T20:33:11.8181975Z 2025-05-07T20:33:11.8182208Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8182524Z op = silu_mul_quant 2025-05-07T20:33:11.8182771Z if compiled: 2025-05-07T20:33:11.8183095Z op = torch.compile(op) 2025-05-07T20:33:11.8183394Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8183671Z 2025-05-07T20:33:11.8183864Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8184027Z 2025-05-07T20:33:11.8184165Z moe/activation_test.py:117: 2025-05-07T20:33:11.8184462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8184803Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8185083Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8185799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8186528Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8187090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8187798Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8188498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8189066Z kernel = self.compile( 2025-05-07T20:33:11.8189672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8190354Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8190760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8190993Z 2025-05-07T20:33:11.8191209Z self = 2025-05-07T20:33:11.8192333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8193753Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f994386c4a0>} 2025-05-07T20:33:11.8195160Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8196249Z context = 2025-05-07T20:33:11.8196549Z 2025-05-07T20:33:11.8196726Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8197259Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8197745Z module_map=module_map) 2025-05-07T20:33:11.8198116Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8198479Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8198738Z E ^ 2025-05-07T20:33:11.8199222Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8199691Z 2025-05-07T20:33:11.8200131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8200674Z 2025-05-07T20:33:11.8200778Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8201204Z self=, 2025-05-07T20:33:11.8201616Z T=4096, 2025-05-07T20:33:11.8201806Z D=7168, 2025-05-07T20:33:11.8201991Z scale_ub=None, 2025-05-07T20:33:11.8202205Z contiguous=False, 2025-05-07T20:33:11.8202482Z compiled=False, 2025-05-07T20:33:11.8202679Z ) 2025-05-07T20:33:11.8202999Z self = 2025-05-07T20:33:11.8203509Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:11.8203791Z 2025-05-07T20:33:11.8203943Z @given( 2025-05-07T20:33:11.8204174Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8204490Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8204792Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8205165Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8205499Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8205785Z ) 2025-05-07T20:33:11.8206128Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8206580Z def test_silu_mul_quant( 2025-05-07T20:33:11.8206825Z self, 2025-05-07T20:33:11.8207017Z T: int, 2025-05-07T20:33:11.8207213Z D: int, 2025-05-07T20:33:11.8207432Z scale_ub: Optional[float], 2025-05-07T20:33:11.8207703Z contiguous: bool, 2025-05-07T20:33:11.8207943Z compiled: bool, 2025-05-07T20:33:11.8208169Z ) -> None: 2025-05-07T20:33:11.8208384Z torch.manual_seed(2025) 2025-05-07T20:33:11.8217044Z 2025-05-07T20:33:11.8217364Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8217730Z 2025-05-07T20:33:11.8217935Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8218244Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8218577Z x = x_sign * x_clamp 2025-05-07T20:33:11.8218830Z x0 = x[:, :D] 2025-05-07T20:33:11.8219050Z x1 = x[:, D:] 2025-05-07T20:33:11.8219302Z 2025-05-07T20:33:11.8219516Z if contiguous: 2025-05-07T20:33:11.8219746Z x0 = x0.contiguous() 2025-05-07T20:33:11.8220012Z x1 = x1.contiguous() 2025-05-07T20:33:11.8220266Z 2025-05-07T20:33:11.8220457Z if scale_ub is not None: 2025-05-07T20:33:11.8220738Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8221080Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8221395Z ) 2025-05-07T20:33:11.8221601Z else: 2025-05-07T20:33:11.8221820Z scale_ub_tensor = None 2025-05-07T20:33:11.8222083Z 2025-05-07T20:33:11.8222314Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8222643Z op = silu_mul_quant 2025-05-07T20:33:11.8222901Z if compiled: 2025-05-07T20:33:11.8223147Z op = torch.compile(op) 2025-05-07T20:33:11.8223455Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8223741Z 2025-05-07T20:33:11.8223936Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8224109Z 2025-05-07T20:33:11.8224211Z moe/activation_test.py:117: 2025-05-07T20:33:11.8224516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8224856Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8225145Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8226237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8226984Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8227545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8228277Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8228987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8229550Z kernel = self.compile( 2025-05-07T20:33:11.8230126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8231005Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8231432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8231676Z 2025-05-07T20:33:11.8231890Z self = 2025-05-07T20:33:11.8233160Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8234663Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f994386df80>} 2025-05-07T20:33:11.8236085Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8237183Z context = 2025-05-07T20:33:11.8237485Z 2025-05-07T20:33:11.8237659Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8238214Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8238710Z module_map=module_map) 2025-05-07T20:33:11.8239125Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8239503Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8239781Z E ^ 2025-05-07T20:33:11.8240269Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8240746Z 2025-05-07T20:33:11.8241188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8241744Z 2025-05-07T20:33:11.8241855Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8242287Z self=, 2025-05-07T20:33:11.8242709Z T=128, 2025-05-07T20:33:11.8242897Z D=7168, 2025-05-07T20:33:11.8243102Z scale_ub=None, 2025-05-07T20:33:11.8243338Z contiguous=False, 2025-05-07T20:33:11.8243564Z compiled=True, 2025-05-07T20:33:11.8243773Z ) 2025-05-07T20:33:11.8244104Z self = 2025-05-07T20:33:11.8244611Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:11.8244903Z 2025-05-07T20:33:11.8244986Z @given( 2025-05-07T20:33:11.8245223Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8245549Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8245873Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8246224Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8246584Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8246877Z ) 2025-05-07T20:33:11.8247243Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8247711Z def test_silu_mul_quant( 2025-05-07T20:33:11.8247962Z self, 2025-05-07T20:33:11.8248172Z T: int, 2025-05-07T20:33:11.8248376Z D: int, 2025-05-07T20:33:11.8248595Z scale_ub: Optional[float], 2025-05-07T20:33:11.8248878Z contiguous: bool, 2025-05-07T20:33:11.8249126Z compiled: bool, 2025-05-07T20:33:11.8249355Z ) -> None: 2025-05-07T20:33:11.8249578Z torch.manual_seed(2025) 2025-05-07T20:33:11.8249830Z 2025-05-07T20:33:11.8250114Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8250473Z 2025-05-07T20:33:11.8250677Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8250970Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8251355Z x = x_sign * x_clamp 2025-05-07T20:33:11.8251613Z x0 = x[:, :D] 2025-05-07T20:33:11.8251844Z x1 = x[:, D:] 2025-05-07T20:33:11.8252061Z 2025-05-07T20:33:11.8252265Z if contiguous: 2025-05-07T20:33:11.8252513Z x0 = x0.contiguous() 2025-05-07T20:33:11.8252859Z x1 = x1.contiguous() 2025-05-07T20:33:11.8253124Z 2025-05-07T20:33:11.8253333Z if scale_ub is not None: 2025-05-07T20:33:11.8253613Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8253967Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8254419Z ) 2025-05-07T20:33:11.8254622Z else: 2025-05-07T20:33:11.8254847Z scale_ub_tensor = None 2025-05-07T20:33:11.8255110Z 2025-05-07T20:33:11.8255345Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8255675Z op = silu_mul_quant 2025-05-07T20:33:11.8255941Z if compiled: 2025-05-07T20:33:11.8256194Z op = torch.compile(op) 2025-05-07T20:33:11.8256505Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8256791Z 2025-05-07T20:33:11.8256985Z y_fp8, y_scale = fn() 2025-05-07T20:33:11.8257278Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:11.8257581Z 2025-05-07T20:33:11.8257835Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8258175Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:11.8258481Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:11.8258813Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:11.8259233Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8259560Z 2025-05-07T20:33:11.8259772Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:11.8259973Z 2025-05-07T20:33:11.8260076Z moe/activation_test.py:126: 2025-05-07T20:33:11.8260380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8260733Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:11.8261070Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8261895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:11.8262697Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:11.8263273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8263990Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8264724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:11.8265492Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:11.8266273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:11.8266950Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:11.8267588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:11.8268140Z fn() 2025-05-07T20:33:11.8268685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:11.8269302Z self.fn.run( 2025-05-07T20:33:11.8269797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8270368Z kernel = self.compile( 2025-05-07T20:33:11.8270933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8271632Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8272053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8272350Z 2025-05-07T20:33:11.8272573Z self = 2025-05-07T20:33:11.8273779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8275225Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f994386ec00>} 2025-05-07T20:33:11.8276690Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8277785Z context = 2025-05-07T20:33:11.8277794Z 2025-05-07T20:33:11.8277965Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8278247Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8278359Z module_map=module_map) 2025-05-07T20:33:11.8278531Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8278644Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:11.8278724Z E ^ 2025-05-07T20:33:11.8279097Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8279105Z 2025-05-07T20:33:11.8279552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8279557Z 2025-05-07T20:33:11.8279664Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8279905Z self=, 2025-05-07T20:33:11.8279995Z T=128, 2025-05-07T20:33:11.8280076Z D=7168, 2025-05-07T20:33:11.8280169Z scale_ub=None, 2025-05-07T20:33:11.8280261Z contiguous=False, 2025-05-07T20:33:11.8280350Z compiled=False, 2025-05-07T20:33:11.8280433Z ) 2025-05-07T20:33:11.8280665Z self = 2025-05-07T20:33:11.8280854Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:11.8280859Z 2025-05-07T20:33:11.8280940Z @given( 2025-05-07T20:33:11.8281062Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8281171Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8281288Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8281406Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8281527Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8281604Z ) 2025-05-07T20:33:11.8281860Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8281967Z def test_silu_mul_quant( 2025-05-07T20:33:11.8282046Z self, 2025-05-07T20:33:11.8282129Z T: int, 2025-05-07T20:33:11.8282212Z D: int, 2025-05-07T20:33:11.8282312Z scale_ub: Optional[float], 2025-05-07T20:33:11.8282415Z contiguous: bool, 2025-05-07T20:33:11.8282503Z compiled: bool, 2025-05-07T20:33:11.8282582Z ) -> None: 2025-05-07T20:33:11.8282687Z torch.manual_seed(2025) 2025-05-07T20:33:11.8282763Z 2025-05-07T20:33:11.8282943Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8283028Z 2025-05-07T20:33:11.8283122Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8283250Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8283349Z x = x_sign * x_clamp 2025-05-07T20:33:11.8283435Z x0 = x[:, :D] 2025-05-07T20:33:11.8283521Z x1 = x[:, D:] 2025-05-07T20:33:11.8283654Z 2025-05-07T20:33:11.8283747Z if contiguous: 2025-05-07T20:33:11.8283854Z x0 = x0.contiguous() 2025-05-07T20:33:11.8283947Z x1 = x1.contiguous() 2025-05-07T20:33:11.8284027Z 2025-05-07T20:33:11.8284126Z if scale_ub is not None: 2025-05-07T20:33:11.8284345Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8284487Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8284576Z ) 2025-05-07T20:33:11.8284658Z else: 2025-05-07T20:33:11.8284756Z scale_ub_tensor = None 2025-05-07T20:33:11.8284905Z 2025-05-07T20:33:11.8285037Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8285129Z op = silu_mul_quant 2025-05-07T20:33:11.8285221Z if compiled: 2025-05-07T20:33:11.8285323Z op = torch.compile(op) 2025-05-07T20:33:11.8285436Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8285510Z 2025-05-07T20:33:11.8285608Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8285612Z 2025-05-07T20:33:11.8285718Z moe/activation_test.py:117: 2025-05-07T20:33:11.8285853Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8285957Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8286072Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8286601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8286707Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8287093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8287324Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8287692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8287795Z kernel = self.compile( 2025-05-07T20:33:11.8288200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8288388Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8288526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8288531Z 2025-05-07T20:33:11.8288748Z self = 2025-05-07T20:33:11.8289565Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8290086Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9943d379c0>} 2025-05-07T20:33:11.8290889Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8291086Z context = 2025-05-07T20:33:11.8291095Z 2025-05-07T20:33:11.8291274Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8291552Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8291664Z module_map=module_map) 2025-05-07T20:33:11.8291843Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8291948Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8292027Z E ^ 2025-05-07T20:33:11.8292404Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8292458Z 2025-05-07T20:33:11.8292897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8292902Z 2025-05-07T20:33:11.8293013Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8293317Z self=, 2025-05-07T20:33:11.8293404Z T=4096, 2025-05-07T20:33:11.8293504Z D=5120, 2025-05-07T20:33:11.8293595Z scale_ub=1200.0, 2025-05-07T20:33:11.8293695Z contiguous=True, 2025-05-07T20:33:11.8293785Z compiled=False, 2025-05-07T20:33:11.8293904Z ) 2025-05-07T20:33:11.8294141Z self = 2025-05-07T20:33:11.8294322Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:11.8294327Z 2025-05-07T20:33:11.8294495Z @given( 2025-05-07T20:33:11.8294619Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8294724Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8294858Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8294982Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8295099Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8295188Z ) 2025-05-07T20:33:11.8295451Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8295557Z def test_silu_mul_quant( 2025-05-07T20:33:11.8295639Z self, 2025-05-07T20:33:11.8295722Z T: int, 2025-05-07T20:33:11.8295812Z D: int, 2025-05-07T20:33:11.8295918Z scale_ub: Optional[float], 2025-05-07T20:33:11.8296014Z contiguous: bool, 2025-05-07T20:33:11.8296111Z compiled: bool, 2025-05-07T20:33:11.8296192Z ) -> None: 2025-05-07T20:33:11.8296291Z torch.manual_seed(2025) 2025-05-07T20:33:11.8296374Z 2025-05-07T20:33:11.8296547Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8296630Z 2025-05-07T20:33:11.8296733Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8296863Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8296958Z x = x_sign * x_clamp 2025-05-07T20:33:11.8297056Z x0 = x[:, :D] 2025-05-07T20:33:11.8297140Z x1 = x[:, D:] 2025-05-07T20:33:11.8297228Z 2025-05-07T20:33:11.8297317Z if contiguous: 2025-05-07T20:33:11.8297416Z x0 = x0.contiguous() 2025-05-07T20:33:11.8297520Z x1 = x1.contiguous() 2025-05-07T20:33:11.8297599Z 2025-05-07T20:33:11.8297697Z if scale_ub is not None: 2025-05-07T20:33:11.8297814Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8297954Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8298037Z ) 2025-05-07T20:33:11.8298130Z else: 2025-05-07T20:33:11.8298231Z scale_ub_tensor = None 2025-05-07T20:33:11.8298310Z 2025-05-07T20:33:11.8298449Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8298548Z op = silu_mul_quant 2025-05-07T20:33:11.8298649Z if compiled: 2025-05-07T20:33:11.8298754Z op = torch.compile(op) 2025-05-07T20:33:11.8298866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8298950Z 2025-05-07T20:33:11.8299050Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8299055Z 2025-05-07T20:33:11.8299156Z moe/activation_test.py:117: 2025-05-07T20:33:11.8299300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8299411Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8299516Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8300050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8300149Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8300534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8300816Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8301175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8301349Z kernel = self.compile( 2025-05-07T20:33:11.8301758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8301936Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8302115Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8302120Z 2025-05-07T20:33:11.8302329Z self = 2025-05-07T20:33:11.8303148Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8303664Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f99482e2520>} 2025-05-07T20:33:11.8304472Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8304671Z context = 2025-05-07T20:33:11.8304676Z 2025-05-07T20:33:11.8304847Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8305124Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8305233Z module_map=module_map) 2025-05-07T20:33:11.8305404Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8305504Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8305584Z E ^ 2025-05-07T20:33:11.8305959Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8305968Z 2025-05-07T20:33:11.8306403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8306407Z 2025-05-07T20:33:11.8306520Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8306752Z self=, 2025-05-07T20:33:11.8306833Z T=1, 2025-05-07T20:33:11.8306919Z D=5120, 2025-05-07T20:33:11.8307002Z scale_ub=None, 2025-05-07T20:33:11.8307091Z contiguous=True, 2025-05-07T20:33:11.8307181Z compiled=True, 2025-05-07T20:33:11.8307256Z ) 2025-05-07T20:33:11.8307479Z self = 2025-05-07T20:33:11.8307654Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:11.8307659Z 2025-05-07T20:33:11.8307738Z @given( 2025-05-07T20:33:11.8307860Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8307973Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8308091Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8308215Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8308330Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8308407Z ) 2025-05-07T20:33:11.8308669Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8308765Z def test_silu_mul_quant( 2025-05-07T20:33:11.8308844Z self, 2025-05-07T20:33:11.8308930Z T: int, 2025-05-07T20:33:11.8309008Z D: int, 2025-05-07T20:33:11.8309110Z scale_ub: Optional[float], 2025-05-07T20:33:11.8309260Z contiguous: bool, 2025-05-07T20:33:11.8309349Z compiled: bool, 2025-05-07T20:33:11.8309429Z ) -> None: 2025-05-07T20:33:11.8309531Z torch.manual_seed(2025) 2025-05-07T20:33:11.8309606Z 2025-05-07T20:33:11.8309785Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8309936Z 2025-05-07T20:33:11.8310033Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8310165Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8310260Z x = x_sign * x_clamp 2025-05-07T20:33:11.8310347Z x0 = x[:, :D] 2025-05-07T20:33:11.8310479Z x1 = x[:, D:] 2025-05-07T20:33:11.8310558Z 2025-05-07T20:33:11.8310648Z if contiguous: 2025-05-07T20:33:11.8310753Z x0 = x0.contiguous() 2025-05-07T20:33:11.8310849Z x1 = x1.contiguous() 2025-05-07T20:33:11.8310928Z 2025-05-07T20:33:11.8311030Z if scale_ub is not None: 2025-05-07T20:33:11.8311142Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8311289Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8311372Z ) 2025-05-07T20:33:11.8311456Z else: 2025-05-07T20:33:11.8311564Z scale_ub_tensor = None 2025-05-07T20:33:11.8311644Z 2025-05-07T20:33:11.8311781Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8311883Z op = silu_mul_quant 2025-05-07T20:33:11.8311971Z if compiled: 2025-05-07T20:33:11.8312071Z op = torch.compile(op) 2025-05-07T20:33:11.8312184Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8312261Z 2025-05-07T20:33:11.8312354Z y_fp8, y_scale = fn() 2025-05-07T20:33:11.8312486Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:11.8312560Z 2025-05-07T20:33:11.8312704Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8312809Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:11.8312917Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:11.8313045Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:11.8313186Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8313261Z 2025-05-07T20:33:11.8313374Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:11.8313378Z 2025-05-07T20:33:11.8313478Z moe/activation_test.py:126: 2025-05-07T20:33:11.8313612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8313726Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:11.8313866Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8314464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:11.8314566Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:11.8314946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8315186Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8315573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:11.8315851Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:11.8316246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:11.8316417Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:11.8316782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:11.8316861Z fn() 2025-05-07T20:33:11.8317284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:11.8317422Z self.fn.run( 2025-05-07T20:33:11.8317778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8317879Z kernel = self.compile( 2025-05-07T20:33:11.8318378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8318559Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8318702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8318744Z 2025-05-07T20:33:11.8318956Z self = 2025-05-07T20:33:11.8319772Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8320285Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f99484c7d80>} 2025-05-07T20:33:11.8321085Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8321284Z context = 2025-05-07T20:33:11.8321288Z 2025-05-07T20:33:11.8321458Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8321737Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8321846Z module_map=module_map) 2025-05-07T20:33:11.8322009Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8322119Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:11.8322199Z E ^ 2025-05-07T20:33:11.8322574Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8322585Z 2025-05-07T20:33:11.8323021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8323030Z 2025-05-07T20:33:11.8323140Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8323374Z self=, 2025-05-07T20:33:11.8323453Z T=2048, 2025-05-07T20:33:11.8323535Z D=5120, 2025-05-07T20:33:11.8323627Z scale_ub=None, 2025-05-07T20:33:11.8323716Z contiguous=True, 2025-05-07T20:33:11.8323801Z compiled=True, 2025-05-07T20:33:11.8323886Z ) 2025-05-07T20:33:11.8324112Z self = 2025-05-07T20:33:11.8324294Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:11.8324301Z 2025-05-07T20:33:11.8324381Z @given( 2025-05-07T20:33:11.8324506Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8324613Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8324730Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8324853Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8324976Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8325052Z ) 2025-05-07T20:33:11.8325311Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8325631Z def test_silu_mul_quant( 2025-05-07T20:33:11.8325742Z self, 2025-05-07T20:33:11.8325856Z T: int, 2025-05-07T20:33:11.8325962Z D: int, 2025-05-07T20:33:11.8326084Z scale_ub: Optional[float], 2025-05-07T20:33:11.8326182Z contiguous: bool, 2025-05-07T20:33:11.8326267Z compiled: bool, 2025-05-07T20:33:11.8326345Z ) -> None: 2025-05-07T20:33:11.8326557Z torch.manual_seed(2025) 2025-05-07T20:33:11.8326629Z 2025-05-07T20:33:11.8326801Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8326883Z 2025-05-07T20:33:11.8326975Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8327220Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8327328Z x = x_sign * x_clamp 2025-05-07T20:33:11.8327411Z x0 = x[:, :D] 2025-05-07T20:33:11.8327501Z x1 = x[:, D:] 2025-05-07T20:33:11.8327576Z 2025-05-07T20:33:11.8327662Z if contiguous: 2025-05-07T20:33:11.8327825Z x0 = x0.contiguous() 2025-05-07T20:33:11.8327920Z x1 = x1.contiguous() 2025-05-07T20:33:11.8327996Z 2025-05-07T20:33:11.8328098Z if scale_ub is not None: 2025-05-07T20:33:11.8328207Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8328346Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8328429Z ) 2025-05-07T20:33:11.8328512Z else: 2025-05-07T20:33:11.8328611Z scale_ub_tensor = None 2025-05-07T20:33:11.8328688Z 2025-05-07T20:33:11.8328817Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8328914Z op = silu_mul_quant 2025-05-07T20:33:11.8328998Z if compiled: 2025-05-07T20:33:11.8329104Z op = torch.compile(op) 2025-05-07T20:33:11.8329238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8329320Z 2025-05-07T20:33:11.8329428Z y_fp8, y_scale = fn() 2025-05-07T20:33:11.8329558Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:11.8329630Z 2025-05-07T20:33:11.8329766Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8329878Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:11.8329980Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:11.8330104Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:11.8330260Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8330336Z 2025-05-07T20:33:11.8330448Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:11.8330452Z 2025-05-07T20:33:11.8330553Z moe/activation_test.py:126: 2025-05-07T20:33:11.8330691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8330805Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:11.8330944Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8331532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:11.8331641Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:11.8332020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8332256Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8332645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:11.8332908Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:11.8333314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:11.8333483Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:11.8333853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:11.8333935Z fn() 2025-05-07T20:33:11.8334430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:11.8334518Z self.fn.run( 2025-05-07T20:33:11.8334873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8335019Z kernel = self.compile( 2025-05-07T20:33:11.8335427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8335603Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8335825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8335830Z 2025-05-07T20:33:11.8336042Z self = 2025-05-07T20:33:11.8336853Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8337416Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f994384f060>} 2025-05-07T20:33:11.8338207Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8338411Z context = 2025-05-07T20:33:11.8338415Z 2025-05-07T20:33:11.8338584Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8338854Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8338970Z module_map=module_map) 2025-05-07T20:33:11.8339156Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8339278Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:11.8339365Z E ^ 2025-05-07T20:33:11.8339731Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8339738Z 2025-05-07T20:33:11.8340179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8340184Z 2025-05-07T20:33:11.8340288Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8340528Z self=, 2025-05-07T20:33:11.8340607Z T=128, 2025-05-07T20:33:11.8340685Z D=5120, 2025-05-07T20:33:11.8340777Z scale_ub=None, 2025-05-07T20:33:11.8340860Z contiguous=True, 2025-05-07T20:33:11.8340944Z compiled=True, 2025-05-07T20:33:11.8341028Z ) 2025-05-07T20:33:11.8341259Z self = 2025-05-07T20:33:11.8341432Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:11.8341436Z 2025-05-07T20:33:11.8341522Z @given( 2025-05-07T20:33:11.8341643Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8341756Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8341876Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8341996Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8342118Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8342194Z ) 2025-05-07T20:33:11.8342455Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8342558Z def test_silu_mul_quant( 2025-05-07T20:33:11.8342637Z self, 2025-05-07T20:33:11.8342713Z T: int, 2025-05-07T20:33:11.8342802Z D: int, 2025-05-07T20:33:11.8342904Z scale_ub: Optional[float], 2025-05-07T20:33:11.8342996Z contiguous: bool, 2025-05-07T20:33:11.8343091Z compiled: bool, 2025-05-07T20:33:11.8343168Z ) -> None: 2025-05-07T20:33:11.8343276Z torch.manual_seed(2025) 2025-05-07T20:33:11.8343353Z 2025-05-07T20:33:11.8343533Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8343664Z 2025-05-07T20:33:11.8343760Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8343890Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8343988Z x = x_sign * x_clamp 2025-05-07T20:33:11.8344071Z x0 = x[:, :D] 2025-05-07T20:33:11.8344228Z x1 = x[:, D:] 2025-05-07T20:33:11.8344310Z 2025-05-07T20:33:11.8344396Z if contiguous: 2025-05-07T20:33:11.8344491Z x0 = x0.contiguous() 2025-05-07T20:33:11.8344597Z x1 = x1.contiguous() 2025-05-07T20:33:11.8344672Z 2025-05-07T20:33:11.8344812Z if scale_ub is not None: 2025-05-07T20:33:11.8344921Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8345058Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8345143Z ) 2025-05-07T20:33:11.8345224Z else: 2025-05-07T20:33:11.8345320Z scale_ub_tensor = None 2025-05-07T20:33:11.8345401Z 2025-05-07T20:33:11.8345536Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8345632Z op = silu_mul_quant 2025-05-07T20:33:11.8345728Z if compiled: 2025-05-07T20:33:11.8345833Z op = torch.compile(op) 2025-05-07T20:33:11.8345943Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8346034Z 2025-05-07T20:33:11.8346131Z y_fp8, y_scale = fn() 2025-05-07T20:33:11.8346266Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:11.8346345Z 2025-05-07T20:33:11.8346485Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8346604Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:11.8346710Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:11.8346838Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:11.8346991Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8347069Z 2025-05-07T20:33:11.8347176Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:11.8347184Z 2025-05-07T20:33:11.8347294Z moe/activation_test.py:126: 2025-05-07T20:33:11.8347433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8347550Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:11.8347696Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8348290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:11.8348400Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:11.8348784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8349042Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8349462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:11.8349731Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:11.8350135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:11.8350315Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:11.8350680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:11.8350769Z fn() 2025-05-07T20:33:11.8351195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:11.8351293Z self.fn.run( 2025-05-07T20:33:11.8351649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8351744Z kernel = self.compile( 2025-05-07T20:33:11.8352152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8352382Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8352517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8352522Z 2025-05-07T20:33:11.8352834Z self = 2025-05-07T20:33:11.8353653Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8354217Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9942aa9d00>} 2025-05-07T20:33:11.8355011Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8355217Z context = 2025-05-07T20:33:11.8355221Z 2025-05-07T20:33:11.8355393Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8355672Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8355787Z module_map=module_map) 2025-05-07T20:33:11.8355953Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8356061Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:11.8356145Z E ^ 2025-05-07T20:33:11.8356518Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8356523Z 2025-05-07T20:33:11.8356967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8356975Z 2025-05-07T20:33:11.8357083Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8357314Z self=, 2025-05-07T20:33:11.8357398Z T=4096, 2025-05-07T20:33:11.8357477Z D=5120, 2025-05-07T20:33:11.8357579Z scale_ub=None, 2025-05-07T20:33:11.8357674Z contiguous=True, 2025-05-07T20:33:11.8364011Z compiled=True, 2025-05-07T20:33:11.8364112Z ) 2025-05-07T20:33:11.8364363Z self = 2025-05-07T20:33:11.8364549Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:11.8364555Z 2025-05-07T20:33:11.8364645Z @given( 2025-05-07T20:33:11.8364772Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8364878Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8365009Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8365133Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8365251Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8365341Z ) 2025-05-07T20:33:11.8365599Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8365698Z def test_silu_mul_quant( 2025-05-07T20:33:11.8365797Z self, 2025-05-07T20:33:11.8365878Z T: int, 2025-05-07T20:33:11.8365959Z D: int, 2025-05-07T20:33:11.8366068Z scale_ub: Optional[float], 2025-05-07T20:33:11.8366162Z contiguous: bool, 2025-05-07T20:33:11.8366263Z compiled: bool, 2025-05-07T20:33:11.8366347Z ) -> None: 2025-05-07T20:33:11.8366445Z torch.manual_seed(2025) 2025-05-07T20:33:11.8366530Z 2025-05-07T20:33:11.8366703Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8366781Z 2025-05-07T20:33:11.8366883Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8367013Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8367186Z x = x_sign * x_clamp 2025-05-07T20:33:11.8367277Z x0 = x[:, :D] 2025-05-07T20:33:11.8367362Z x1 = x[:, D:] 2025-05-07T20:33:11.8367439Z 2025-05-07T20:33:11.8367536Z if contiguous: 2025-05-07T20:33:11.8367632Z x0 = x0.contiguous() 2025-05-07T20:33:11.8367806Z x1 = x1.contiguous() 2025-05-07T20:33:11.8367882Z 2025-05-07T20:33:11.8367976Z if scale_ub is not None: 2025-05-07T20:33:11.8368092Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8368272Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8368352Z ) 2025-05-07T20:33:11.8368439Z else: 2025-05-07T20:33:11.8368536Z scale_ub_tensor = None 2025-05-07T20:33:11.8368612Z 2025-05-07T20:33:11.8368752Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8368847Z op = silu_mul_quant 2025-05-07T20:33:11.8368938Z if compiled: 2025-05-07T20:33:11.8369053Z op = torch.compile(op) 2025-05-07T20:33:11.8369162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8369237Z 2025-05-07T20:33:11.8369339Z y_fp8, y_scale = fn() 2025-05-07T20:33:11.8369469Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:11.8369557Z 2025-05-07T20:33:11.8369698Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8369803Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:11.8369916Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:11.8370044Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:11.8370188Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8370272Z 2025-05-07T20:33:11.8370378Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:11.8370382Z 2025-05-07T20:33:11.8370492Z moe/activation_test.py:126: 2025-05-07T20:33:11.8370629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8370741Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:11.8370887Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8371489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:11.8371595Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:11.8371987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8372221Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8372622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:11.8372892Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:11.8373292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:11.8373475Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:11.8373842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:11.8373923Z fn() 2025-05-07T20:33:11.8374507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:11.8374596Z self.fn.run( 2025-05-07T20:33:11.8374968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8375068Z kernel = self.compile( 2025-05-07T20:33:11.8375471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8375658Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8375869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8375874Z 2025-05-07T20:33:11.8376094Z self = 2025-05-07T20:33:11.8376988Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8377510Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9942710ae0>} 2025-05-07T20:33:11.8378351Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8378551Z context = 2025-05-07T20:33:11.8378559Z 2025-05-07T20:33:11.8378745Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8379026Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8379145Z module_map=module_map) 2025-05-07T20:33:11.8379331Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8379443Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:11.8379531Z E ^ 2025-05-07T20:33:11.8379915Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8379924Z 2025-05-07T20:33:11.8380363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8380368Z 2025-05-07T20:33:11.8380485Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8380716Z self=, 2025-05-07T20:33:11.8380800Z T=16384, 2025-05-07T20:33:11.8380889Z D=5120, 2025-05-07T20:33:11.8380977Z scale_ub=None, 2025-05-07T20:33:11.8381076Z contiguous=True, 2025-05-07T20:33:11.8381164Z compiled=True, 2025-05-07T20:33:11.8381243Z ) 2025-05-07T20:33:11.8381485Z self = 2025-05-07T20:33:11.8381669Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:11.8381673Z 2025-05-07T20:33:11.8381759Z @given( 2025-05-07T20:33:11.8381896Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8382001Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8382121Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8382254Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8382372Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8382461Z ) 2025-05-07T20:33:11.8382720Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8382817Z def test_silu_mul_quant( 2025-05-07T20:33:11.8382907Z self, 2025-05-07T20:33:11.8382988Z T: int, 2025-05-07T20:33:11.8383070Z D: int, 2025-05-07T20:33:11.8383185Z scale_ub: Optional[float], 2025-05-07T20:33:11.8383279Z contiguous: bool, 2025-05-07T20:33:11.8383368Z compiled: bool, 2025-05-07T20:33:11.8383458Z ) -> None: 2025-05-07T20:33:11.8383558Z torch.manual_seed(2025) 2025-05-07T20:33:11.8383637Z 2025-05-07T20:33:11.8383821Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8383900Z 2025-05-07T20:33:11.8384003Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8384133Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8384229Z x = x_sign * x_clamp 2025-05-07T20:33:11.8384322Z x0 = x[:, :D] 2025-05-07T20:33:11.8384458Z x1 = x[:, D:] 2025-05-07T20:33:11.8384540Z 2025-05-07T20:33:11.8384639Z if contiguous: 2025-05-07T20:33:11.8384740Z x0 = x0.contiguous() 2025-05-07T20:33:11.8384837Z x1 = x1.contiguous() 2025-05-07T20:33:11.8384925Z 2025-05-07T20:33:11.8385096Z if scale_ub is not None: 2025-05-07T20:33:11.8385210Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8385362Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8385445Z ) 2025-05-07T20:33:11.8385529Z else: 2025-05-07T20:33:11.8385678Z scale_ub_tensor = None 2025-05-07T20:33:11.8385759Z 2025-05-07T20:33:11.8385907Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8386002Z op = silu_mul_quant 2025-05-07T20:33:11.8386089Z if compiled: 2025-05-07T20:33:11.8386202Z op = torch.compile(op) 2025-05-07T20:33:11.8386312Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8386396Z 2025-05-07T20:33:11.8386497Z y_fp8, y_scale = fn() 2025-05-07T20:33:11.8386622Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:11.8386696Z 2025-05-07T20:33:11.8386847Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8386958Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:11.8387070Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:11.8387196Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:11.8387339Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8387425Z 2025-05-07T20:33:11.8387532Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:11.8387536Z 2025-05-07T20:33:11.8387638Z moe/activation_test.py:126: 2025-05-07T20:33:11.8387781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8387891Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:11.8388030Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8388630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:11.8388735Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:11.8389168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8389420Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8389813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:11.8390091Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:11.8390488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:11.8390668Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:11.8391034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:11.8391114Z fn() 2025-05-07T20:33:11.8391555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:11.8391644Z self.fn.run( 2025-05-07T20:33:11.8392002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8392109Z kernel = self.compile( 2025-05-07T20:33:11.8392517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8392710Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8392849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8392853Z 2025-05-07T20:33:11.8393137Z self = 2025-05-07T20:33:11.8394037Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8394559Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98578380e0>} 2025-05-07T20:33:11.8395366Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8395606Z context = 2025-05-07T20:33:11.8395611Z 2025-05-07T20:33:11.8395796Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8396079Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8396191Z module_map=module_map) 2025-05-07T20:33:11.8396366Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8396481Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:11.8396563Z E ^ 2025-05-07T20:33:11.8396945Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8396950Z 2025-05-07T20:33:11.8397393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8397397Z 2025-05-07T20:33:11.8397515Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8397746Z self=, 2025-05-07T20:33:11.8397829Z T=1, 2025-05-07T20:33:11.8397920Z D=5120, 2025-05-07T20:33:11.8398010Z scale_ub=1200.0, 2025-05-07T20:33:11.8398099Z contiguous=True, 2025-05-07T20:33:11.8398196Z compiled=True, 2025-05-07T20:33:11.8398275Z ) 2025-05-07T20:33:11.8398503Z self = 2025-05-07T20:33:11.8398689Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:11.8398694Z 2025-05-07T20:33:11.8398779Z @given( 2025-05-07T20:33:11.8398909Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8399014Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8399140Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8399270Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8399387Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8399464Z ) 2025-05-07T20:33:11.8399724Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8399826Z def test_silu_mul_quant( 2025-05-07T20:33:11.8399909Z self, 2025-05-07T20:33:11.8399996Z T: int, 2025-05-07T20:33:11.8400076Z D: int, 2025-05-07T20:33:11.8400190Z scale_ub: Optional[float], 2025-05-07T20:33:11.8400287Z contiguous: bool, 2025-05-07T20:33:11.8400378Z compiled: bool, 2025-05-07T20:33:11.8400471Z ) -> None: 2025-05-07T20:33:11.8400572Z torch.manual_seed(2025) 2025-05-07T20:33:11.8400648Z 2025-05-07T20:33:11.8400830Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8400909Z 2025-05-07T20:33:11.8401009Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8401145Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8401238Z x = x_sign * x_clamp 2025-05-07T20:33:11.8401321Z x0 = x[:, :D] 2025-05-07T20:33:11.8401414Z x1 = x[:, D:] 2025-05-07T20:33:11.8401490Z 2025-05-07T20:33:11.8401578Z if contiguous: 2025-05-07T20:33:11.8401682Z x0 = x0.contiguous() 2025-05-07T20:33:11.8401824Z x1 = x1.contiguous() 2025-05-07T20:33:11.8401906Z 2025-05-07T20:33:11.8402003Z if scale_ub is not None: 2025-05-07T20:33:11.8402113Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8402328Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8402408Z ) 2025-05-07T20:33:11.8402489Z else: 2025-05-07T20:33:11.8402595Z scale_ub_tensor = None 2025-05-07T20:33:11.8402672Z 2025-05-07T20:33:11.8402805Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8402950Z op = silu_mul_quant 2025-05-07T20:33:11.8403041Z if compiled: 2025-05-07T20:33:11.8403149Z op = torch.compile(op) 2025-05-07T20:33:11.8403269Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8403347Z 2025-05-07T20:33:11.8403450Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8403455Z 2025-05-07T20:33:11.8403565Z moe/activation_test.py:117: 2025-05-07T20:33:11.8403702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8403816Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8403924Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8404322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8404426Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8404949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8405062Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8405443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8405676Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8406042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8406145Z kernel = self.compile( 2025-05-07T20:33:11.8406551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8406743Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8406877Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8406882Z 2025-05-07T20:33:11.8407102Z self = 2025-05-07T20:33:11.8407919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8408445Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9942710180>} 2025-05-07T20:33:11.8409239Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8409444Z context = 2025-05-07T20:33:11.8409449Z 2025-05-07T20:33:11.8409628Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8409904Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8410024Z module_map=module_map) 2025-05-07T20:33:11.8410197Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8410303Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8410398Z E ^ 2025-05-07T20:33:11.8410774Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8410824Z 2025-05-07T20:33:11.8411265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8411278Z 2025-05-07T20:33:11.8411386Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8411691Z self=, 2025-05-07T20:33:11.8411786Z T=1, 2025-05-07T20:33:11.8411873Z D=5120, 2025-05-07T20:33:11.8411965Z scale_ub=None, 2025-05-07T20:33:11.8412116Z contiguous=False, 2025-05-07T20:33:11.8412209Z compiled=True, 2025-05-07T20:33:11.8412290Z ) 2025-05-07T20:33:11.8412527Z self = 2025-05-07T20:33:11.8412703Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:11.8412707Z 2025-05-07T20:33:11.8412792Z @given( 2025-05-07T20:33:11.8412925Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8413037Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8413168Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8413293Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8413419Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8413510Z ) 2025-05-07T20:33:11.8413767Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8413870Z def test_silu_mul_quant( 2025-05-07T20:33:11.8413966Z self, 2025-05-07T20:33:11.8414054Z T: int, 2025-05-07T20:33:11.8414139Z D: int, 2025-05-07T20:33:11.8414250Z scale_ub: Optional[float], 2025-05-07T20:33:11.8414457Z contiguous: bool, 2025-05-07T20:33:11.8414553Z compiled: bool, 2025-05-07T20:33:11.8414634Z ) -> None: 2025-05-07T20:33:11.8414732Z torch.manual_seed(2025) 2025-05-07T20:33:11.8414818Z 2025-05-07T20:33:11.8414996Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8415074Z 2025-05-07T20:33:11.8415177Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8415308Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8415401Z x = x_sign * x_clamp 2025-05-07T20:33:11.8415499Z x0 = x[:, :D] 2025-05-07T20:33:11.8415580Z x1 = x[:, D:] 2025-05-07T20:33:11.8415655Z 2025-05-07T20:33:11.8415747Z if contiguous: 2025-05-07T20:33:11.8415841Z x0 = x0.contiguous() 2025-05-07T20:33:11.8415935Z x1 = x1.contiguous() 2025-05-07T20:33:11.8416021Z 2025-05-07T20:33:11.8416113Z if scale_ub is not None: 2025-05-07T20:33:11.8416229Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8416367Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8416448Z ) 2025-05-07T20:33:11.8416537Z else: 2025-05-07T20:33:11.8416634Z scale_ub_tensor = None 2025-05-07T20:33:11.8416715Z 2025-05-07T20:33:11.8416857Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8416955Z op = silu_mul_quant 2025-05-07T20:33:11.8417047Z if compiled: 2025-05-07T20:33:11.8417161Z op = torch.compile(op) 2025-05-07T20:33:11.8417278Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8417358Z 2025-05-07T20:33:11.8417461Z y_fp8, y_scale = fn() 2025-05-07T20:33:11.8417590Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:11.8417675Z 2025-05-07T20:33:11.8417816Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8417921Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:11.8418034Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:11.8418159Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:11.8418303Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8418436Z 2025-05-07T20:33:11.8418541Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:11.8418546Z 2025-05-07T20:33:11.8418649Z moe/activation_test.py:126: 2025-05-07T20:33:11.8418790Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8418970Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:11.8419143Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8419761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:11.8419928Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:11.8420317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8420549Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8420945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:11.8421217Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:11.8421614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:11.8421801Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:11.8422164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:11.8422244Z fn() 2025-05-07T20:33:11.8422677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:11.8422766Z self.fn.run( 2025-05-07T20:33:11.8423134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8423233Z kernel = self.compile( 2025-05-07T20:33:11.8423642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8423831Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8423967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8423976Z 2025-05-07T20:33:11.8424193Z self = 2025-05-07T20:33:11.8425010Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8425882Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f994243b060>} 2025-05-07T20:33:11.8426737Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8426941Z context = 2025-05-07T20:33:11.8426946Z 2025-05-07T20:33:11.8427136Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8427412Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8427523Z module_map=module_map) 2025-05-07T20:33:11.8427696Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8427813Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:11.8427892Z E ^ 2025-05-07T20:33:11.8428268Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8428273Z 2025-05-07T20:33:11.8428708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8428855Z 2025-05-07T20:33:11.8428964Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8429197Z self=, 2025-05-07T20:33:11.8429293Z T=1, 2025-05-07T20:33:11.8429501Z D=5120, 2025-05-07T20:33:11.8429594Z scale_ub=None, 2025-05-07T20:33:11.8429694Z contiguous=True, 2025-05-07T20:33:11.8429786Z compiled=False, 2025-05-07T20:33:11.8429867Z ) 2025-05-07T20:33:11.8430105Z self = 2025-05-07T20:33:11.8430341Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:11.8430345Z 2025-05-07T20:33:11.8430437Z @given( 2025-05-07T20:33:11.8430563Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8430669Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8430797Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8430924Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8431045Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8431132Z ) 2025-05-07T20:33:11.8431389Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8431495Z def test_silu_mul_quant( 2025-05-07T20:33:11.8431589Z self, 2025-05-07T20:33:11.8431673Z T: int, 2025-05-07T20:33:11.8431755Z D: int, 2025-05-07T20:33:11.8431869Z scale_ub: Optional[float], 2025-05-07T20:33:11.8431966Z contiguous: bool, 2025-05-07T20:33:11.8432065Z compiled: bool, 2025-05-07T20:33:11.8432147Z ) -> None: 2025-05-07T20:33:11.8432244Z torch.manual_seed(2025) 2025-05-07T20:33:11.8432328Z 2025-05-07T20:33:11.8432504Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8432587Z 2025-05-07T20:33:11.8432691Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8432824Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8432918Z x = x_sign * x_clamp 2025-05-07T20:33:11.8433011Z x0 = x[:, :D] 2025-05-07T20:33:11.8433096Z x1 = x[:, D:] 2025-05-07T20:33:11.8433174Z 2025-05-07T20:33:11.8433269Z if contiguous: 2025-05-07T20:33:11.8433371Z x0 = x0.contiguous() 2025-05-07T20:33:11.8433463Z x1 = x1.contiguous() 2025-05-07T20:33:11.8433545Z 2025-05-07T20:33:11.8433641Z if scale_ub is not None: 2025-05-07T20:33:11.8433756Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8433897Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8433976Z ) 2025-05-07T20:33:11.8434062Z else: 2025-05-07T20:33:11.8434161Z scale_ub_tensor = None 2025-05-07T20:33:11.8434236Z 2025-05-07T20:33:11.8434376Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8434476Z op = silu_mul_quant 2025-05-07T20:33:11.8434568Z if compiled: 2025-05-07T20:33:11.8434681Z op = torch.compile(op) 2025-05-07T20:33:11.8434792Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8434872Z 2025-05-07T20:33:11.8434976Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8434985Z 2025-05-07T20:33:11.8435087Z moe/activation_test.py:117: 2025-05-07T20:33:11.8435233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8435341Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8435449Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8435985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8436086Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8436465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8436754Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8437115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8437222Z kernel = self.compile( 2025-05-07T20:33:11.8437707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8437889Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8438028Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8438070Z 2025-05-07T20:33:11.8438280Z self = 2025-05-07T20:33:11.8439101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8439619Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f994243ba60>} 2025-05-07T20:33:11.8440417Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8440621Z context = 2025-05-07T20:33:11.8440628Z 2025-05-07T20:33:11.8440799Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8441077Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8441187Z module_map=module_map) 2025-05-07T20:33:11.8441355Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8441465Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8441544Z E ^ 2025-05-07T20:33:11.8441919Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8441923Z 2025-05-07T20:33:11.8442368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8442373Z 2025-05-07T20:33:11.8442482Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8442721Z self=, 2025-05-07T20:33:11.8442806Z T=128, 2025-05-07T20:33:11.8442887Z D=5120, 2025-05-07T20:33:11.8442982Z scale_ub=None, 2025-05-07T20:33:11.8443074Z contiguous=False, 2025-05-07T20:33:11.8443167Z compiled=True, 2025-05-07T20:33:11.8443243Z ) 2025-05-07T20:33:11.8443471Z self = 2025-05-07T20:33:11.8443655Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:11.8443662Z 2025-05-07T20:33:11.8443742Z @given( 2025-05-07T20:33:11.8443863Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8443975Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8444097Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8444216Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8444341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8444418Z ) 2025-05-07T20:33:11.8444683Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8444786Z def test_silu_mul_quant( 2025-05-07T20:33:11.8444867Z self, 2025-05-07T20:33:11.8444953Z T: int, 2025-05-07T20:33:11.8445035Z D: int, 2025-05-07T20:33:11.8445138Z scale_ub: Optional[float], 2025-05-07T20:33:11.8445238Z contiguous: bool, 2025-05-07T20:33:11.8445331Z compiled: bool, 2025-05-07T20:33:11.8445463Z ) -> None: 2025-05-07T20:33:11.8445571Z torch.manual_seed(2025) 2025-05-07T20:33:11.8445648Z 2025-05-07T20:33:11.8445821Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8445908Z 2025-05-07T20:33:11.8446078Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8446218Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8446310Z x = x_sign * x_clamp 2025-05-07T20:33:11.8446397Z x0 = x[:, :D] 2025-05-07T20:33:11.8446491Z x1 = x[:, D:] 2025-05-07T20:33:11.8446604Z 2025-05-07T20:33:11.8446691Z if contiguous: 2025-05-07T20:33:11.8446793Z x0 = x0.contiguous() 2025-05-07T20:33:11.8446886Z x1 = x1.contiguous() 2025-05-07T20:33:11.8446961Z 2025-05-07T20:33:11.8447064Z if scale_ub is not None: 2025-05-07T20:33:11.8447171Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8447308Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8447396Z ) 2025-05-07T20:33:11.8447475Z else: 2025-05-07T20:33:11.8447580Z scale_ub_tensor = None 2025-05-07T20:33:11.8447657Z 2025-05-07T20:33:11.8447790Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8447902Z op = silu_mul_quant 2025-05-07T20:33:11.8447996Z if compiled: 2025-05-07T20:33:11.8448105Z op = torch.compile(op) 2025-05-07T20:33:11.8448225Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8448307Z 2025-05-07T20:33:11.8448405Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8448409Z 2025-05-07T20:33:11.8448519Z moe/activation_test.py:117: 2025-05-07T20:33:11.8448654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8448761Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8448871Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8449312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8449414Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8449935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8450039Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8450424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8450653Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8451023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8451119Z kernel = self.compile( 2025-05-07T20:33:11.8451522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8451709Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8451839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8451844Z 2025-05-07T20:33:11.8452055Z self = 2025-05-07T20:33:11.8452881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8453404Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857f2d1c0>} 2025-05-07T20:33:11.8454203Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8454563Z context = 2025-05-07T20:33:11.8454567Z 2025-05-07T20:33:11.8454744Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8455116Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8455227Z module_map=module_map) 2025-05-07T20:33:11.8455400Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8455505Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8455685Z E ^ 2025-05-07T20:33:11.8456064Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8456069Z 2025-05-07T20:33:11.8456509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8456513Z 2025-05-07T20:33:11.8456630Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8456861Z self=, 2025-05-07T20:33:11.8456941Z T=128, 2025-05-07T20:33:11.8457028Z D=7168, 2025-05-07T20:33:11.8457116Z scale_ub=1200.0, 2025-05-07T20:33:11.8457204Z contiguous=False, 2025-05-07T20:33:11.8457304Z compiled=False, 2025-05-07T20:33:11.8457384Z ) 2025-05-07T20:33:11.8457616Z self = 2025-05-07T20:33:11.8457796Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:11.8457804Z 2025-05-07T20:33:11.8457884Z @given( 2025-05-07T20:33:11.8458011Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8458113Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8458231Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8458358Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8458479Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8458556Z ) 2025-05-07T20:33:11.8458820Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8458917Z def test_silu_mul_quant( 2025-05-07T20:33:11.8459005Z self, 2025-05-07T20:33:11.8459088Z T: int, 2025-05-07T20:33:11.8459165Z D: int, 2025-05-07T20:33:11.8459271Z scale_ub: Optional[float], 2025-05-07T20:33:11.8459365Z contiguous: bool, 2025-05-07T20:33:11.8459457Z compiled: bool, 2025-05-07T20:33:11.8459542Z ) -> None: 2025-05-07T20:33:11.8459643Z torch.manual_seed(2025) 2025-05-07T20:33:11.8459719Z 2025-05-07T20:33:11.8459897Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8459975Z 2025-05-07T20:33:11.8460069Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8460206Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8460298Z x = x_sign * x_clamp 2025-05-07T20:33:11.8460388Z x0 = x[:, :D] 2025-05-07T20:33:11.8460471Z x1 = x[:, D:] 2025-05-07T20:33:11.8460546Z 2025-05-07T20:33:11.8460638Z if contiguous: 2025-05-07T20:33:11.8460731Z x0 = x0.contiguous() 2025-05-07T20:33:11.8460822Z x1 = x1.contiguous() 2025-05-07T20:33:11.8460913Z 2025-05-07T20:33:11.8461008Z if scale_ub is not None: 2025-05-07T20:33:11.8461115Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8461259Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8461341Z ) 2025-05-07T20:33:11.8461421Z else: 2025-05-07T20:33:11.8461524Z scale_ub_tensor = None 2025-05-07T20:33:11.8461599Z 2025-05-07T20:33:11.8461739Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8461836Z op = silu_mul_quant 2025-05-07T20:33:11.8461925Z if compiled: 2025-05-07T20:33:11.8462036Z op = torch.compile(op) 2025-05-07T20:33:11.8462199Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8462279Z 2025-05-07T20:33:11.8462383Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8462388Z 2025-05-07T20:33:11.8462491Z moe/activation_test.py:117: 2025-05-07T20:33:11.8462699Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8462815Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8462924Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8463457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8463597Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8463974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8464210Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8464571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8464668Z kernel = self.compile( 2025-05-07T20:33:11.8465079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8465263Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8465400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8465404Z 2025-05-07T20:33:11.8465613Z self = 2025-05-07T20:33:11.8466431Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8466955Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857f2cd60>} 2025-05-07T20:33:11.8467752Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8467954Z context = 2025-05-07T20:33:11.8467959Z 2025-05-07T20:33:11.8468127Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8468405Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8468513Z module_map=module_map) 2025-05-07T20:33:11.8468677Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8468784Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8468863Z E ^ 2025-05-07T20:33:11.8469279Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8469287Z 2025-05-07T20:33:11.8469728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8469733Z 2025-05-07T20:33:11.8469843Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8470080Z self=, 2025-05-07T20:33:11.8470163Z T=128, 2025-05-07T20:33:11.8470242Z D=5120, 2025-05-07T20:33:11.8470333Z scale_ub=None, 2025-05-07T20:33:11.8470421Z contiguous=False, 2025-05-07T20:33:11.8470506Z compiled=False, 2025-05-07T20:33:11.8470588Z ) 2025-05-07T20:33:11.8470814Z self = 2025-05-07T20:33:11.8470992Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:11.8471003Z 2025-05-07T20:33:11.8471084Z @given( 2025-05-07T20:33:11.8471253Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8471360Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8471477Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8471596Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8471789Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8471867Z ) 2025-05-07T20:33:11.8472120Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8472220Z def test_silu_mul_quant( 2025-05-07T20:33:11.8472338Z self, 2025-05-07T20:33:11.8472417Z T: int, 2025-05-07T20:33:11.8472501Z D: int, 2025-05-07T20:33:11.8472601Z scale_ub: Optional[float], 2025-05-07T20:33:11.8472700Z contiguous: bool, 2025-05-07T20:33:11.8472788Z compiled: bool, 2025-05-07T20:33:11.8472867Z ) -> None: 2025-05-07T20:33:11.8472973Z torch.manual_seed(2025) 2025-05-07T20:33:11.8473052Z 2025-05-07T20:33:11.8473231Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8473316Z 2025-05-07T20:33:11.8473413Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8473545Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8473655Z x = x_sign * x_clamp 2025-05-07T20:33:11.8473737Z x0 = x[:, :D] 2025-05-07T20:33:11.8473819Z x1 = x[:, D:] 2025-05-07T20:33:11.8473902Z 2025-05-07T20:33:11.8473990Z if contiguous: 2025-05-07T20:33:11.8474083Z x0 = x0.contiguous() 2025-05-07T20:33:11.8474184Z x1 = x1.contiguous() 2025-05-07T20:33:11.8474259Z 2025-05-07T20:33:11.8474361Z if scale_ub is not None: 2025-05-07T20:33:11.8474470Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8474606Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8474689Z ) 2025-05-07T20:33:11.8474768Z else: 2025-05-07T20:33:11.8474868Z scale_ub_tensor = None 2025-05-07T20:33:11.8474950Z 2025-05-07T20:33:11.8475083Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8475180Z op = silu_mul_quant 2025-05-07T20:33:11.8475279Z if compiled: 2025-05-07T20:33:11.8475389Z op = torch.compile(op) 2025-05-07T20:33:11.8475499Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8475585Z 2025-05-07T20:33:11.8475682Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8475687Z 2025-05-07T20:33:11.8475798Z moe/activation_test.py:117: 2025-05-07T20:33:11.8475934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8476040Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8476154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8476683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8476785Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8477168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8477399Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8477767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8477864Z kernel = self.compile( 2025-05-07T20:33:11.8478268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8478456Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8478586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8478591Z 2025-05-07T20:33:11.8478807Z self = 2025-05-07T20:33:11.8479721Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8480310Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857f2e160>} 2025-05-07T20:33:11.8481108Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8481343Z context = 2025-05-07T20:33:11.8481347Z 2025-05-07T20:33:11.8481527Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8481800Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8481911Z module_map=module_map) 2025-05-07T20:33:11.8482085Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8482188Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8482273Z E ^ 2025-05-07T20:33:11.8482652Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8482656Z 2025-05-07T20:33:11.8483095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8483103Z 2025-05-07T20:33:11.8483214Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8483442Z self=, 2025-05-07T20:33:11.8483524Z T=128, 2025-05-07T20:33:11.8483612Z D=5120, 2025-05-07T20:33:11.8483696Z scale_ub=1200.0, 2025-05-07T20:33:11.8483787Z contiguous=True, 2025-05-07T20:33:11.8483877Z compiled=False, 2025-05-07T20:33:11.8483953Z ) 2025-05-07T20:33:11.8484188Z self = 2025-05-07T20:33:11.8484363Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:11.8484367Z 2025-05-07T20:33:11.8484451Z @given( 2025-05-07T20:33:11.8484576Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8484680Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8484796Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8484925Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8485041Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8485126Z ) 2025-05-07T20:33:11.8485377Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8485474Z def test_silu_mul_quant( 2025-05-07T20:33:11.8485563Z self, 2025-05-07T20:33:11.8485645Z T: int, 2025-05-07T20:33:11.8485724Z D: int, 2025-05-07T20:33:11.8485832Z scale_ub: Optional[float], 2025-05-07T20:33:11.8485925Z contiguous: bool, 2025-05-07T20:33:11.8486016Z compiled: bool, 2025-05-07T20:33:11.8486104Z ) -> None: 2025-05-07T20:33:11.8486201Z torch.manual_seed(2025) 2025-05-07T20:33:11.8486281Z 2025-05-07T20:33:11.8486460Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8486537Z 2025-05-07T20:33:11.8486640Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8486762Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8486853Z x = x_sign * x_clamp 2025-05-07T20:33:11.8486936Z x0 = x[:, :D] 2025-05-07T20:33:11.8487015Z x1 = x[:, D:] 2025-05-07T20:33:11.8487086Z 2025-05-07T20:33:11.8487173Z if contiguous: 2025-05-07T20:33:11.8487262Z x0 = x0.contiguous() 2025-05-07T20:33:11.8487351Z x1 = x1.contiguous() 2025-05-07T20:33:11.8487480Z 2025-05-07T20:33:11.8487570Z if scale_ub is not None: 2025-05-07T20:33:11.8487676Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8487819Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8487894Z ) 2025-05-07T20:33:11.8488074Z else: 2025-05-07T20:33:11.8488170Z scale_ub_tensor = None 2025-05-07T20:33:11.8488243Z 2025-05-07T20:33:11.8488380Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8488469Z op = silu_mul_quant 2025-05-07T20:33:11.8488592Z if compiled: 2025-05-07T20:33:11.8488695Z op = torch.compile(op) 2025-05-07T20:33:11.8488799Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8488871Z 2025-05-07T20:33:11.8488976Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8488980Z 2025-05-07T20:33:11.8489075Z moe/activation_test.py:117: 2025-05-07T20:33:11.8489206Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8495452Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8495588Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8496139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8496253Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8496637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8496879Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8497249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8497351Z kernel = self.compile( 2025-05-07T20:33:11.8497770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8497957Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8498091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8498109Z 2025-05-07T20:33:11.8498323Z self = 2025-05-07T20:33:11.8499146Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8499678Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857f2f240>} 2025-05-07T20:33:11.8500477Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8500687Z context = 2025-05-07T20:33:11.8500691Z 2025-05-07T20:33:11.8500867Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8501150Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8501275Z module_map=module_map) 2025-05-07T20:33:11.8501445Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8501561Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8501648Z E ^ 2025-05-07T20:33:11.8502020Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8502025Z 2025-05-07T20:33:11.8502470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8502474Z 2025-05-07T20:33:11.8502661Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8502896Z self=, 2025-05-07T20:33:11.8502989Z T=1, 2025-05-07T20:33:11.8503072Z D=7168, 2025-05-07T20:33:11.8503166Z scale_ub=1200.0, 2025-05-07T20:33:11.8503258Z contiguous=True, 2025-05-07T20:33:11.8503428Z compiled=True, 2025-05-07T20:33:11.8503518Z ) 2025-05-07T20:33:11.8503748Z self = 2025-05-07T20:33:11.8503918Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:11.8503968Z 2025-05-07T20:33:11.8504064Z @given( 2025-05-07T20:33:11.8504192Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8504300Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8504429Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8504554Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8504686Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8504767Z ) 2025-05-07T20:33:11.8505026Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8505137Z def test_silu_mul_quant( 2025-05-07T20:33:11.8505220Z self, 2025-05-07T20:33:11.8505314Z T: int, 2025-05-07T20:33:11.8505400Z D: int, 2025-05-07T20:33:11.8505504Z scale_ub: Optional[float], 2025-05-07T20:33:11.8505596Z contiguous: bool, 2025-05-07T20:33:11.8505694Z compiled: bool, 2025-05-07T20:33:11.8505781Z ) -> None: 2025-05-07T20:33:11.8505881Z torch.manual_seed(2025) 2025-05-07T20:33:11.8505964Z 2025-05-07T20:33:11.8506138Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8506224Z 2025-05-07T20:33:11.8506320Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8506452Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8506555Z x = x_sign * x_clamp 2025-05-07T20:33:11.8506638Z x0 = x[:, :D] 2025-05-07T20:33:11.8506722Z x1 = x[:, D:] 2025-05-07T20:33:11.8506804Z 2025-05-07T20:33:11.8506891Z if contiguous: 2025-05-07T20:33:11.8506987Z x0 = x0.contiguous() 2025-05-07T20:33:11.8507090Z x1 = x1.contiguous() 2025-05-07T20:33:11.8507171Z 2025-05-07T20:33:11.8507266Z if scale_ub is not None: 2025-05-07T20:33:11.8507382Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8507521Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8507612Z ) 2025-05-07T20:33:11.8507693Z else: 2025-05-07T20:33:11.8507791Z scale_ub_tensor = None 2025-05-07T20:33:11.8507877Z 2025-05-07T20:33:11.8508011Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8508106Z op = silu_mul_quant 2025-05-07T20:33:11.8508206Z if compiled: 2025-05-07T20:33:11.8508311Z op = torch.compile(op) 2025-05-07T20:33:11.8508423Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8508506Z 2025-05-07T20:33:11.8508600Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8508605Z 2025-05-07T20:33:11.8508707Z moe/activation_test.py:117: 2025-05-07T20:33:11.8508857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8508965Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8509079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8509469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8509571Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8510107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8510209Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8510590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8510884Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8511249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8511428Z kernel = self.compile( 2025-05-07T20:33:11.8511838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8512019Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8512201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8512206Z 2025-05-07T20:33:11.8512419Z self = 2025-05-07T20:33:11.8513244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8513766Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857368900>} 2025-05-07T20:33:11.8514569Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8514781Z context = 2025-05-07T20:33:11.8514788Z 2025-05-07T20:33:11.8514960Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8515248Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8515359Z module_map=module_map) 2025-05-07T20:33:11.8515531Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8515646Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8515725Z E ^ 2025-05-07T20:33:11.8516109Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8516118Z 2025-05-07T20:33:11.8516560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8516565Z 2025-05-07T20:33:11.8516673Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8516920Z self=, 2025-05-07T20:33:11.8517005Z T=1, 2025-05-07T20:33:11.8517086Z D=7168, 2025-05-07T20:33:11.8517179Z scale_ub=1200.0, 2025-05-07T20:33:11.8517271Z contiguous=False, 2025-05-07T20:33:11.8517367Z compiled=True, 2025-05-07T20:33:11.8517446Z ) 2025-05-07T20:33:11.8517672Z self = 2025-05-07T20:33:11.8517855Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:11.8517860Z 2025-05-07T20:33:11.8517942Z @given( 2025-05-07T20:33:11.8518065Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8518182Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8518301Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8518422Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8518548Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8518627Z ) 2025-05-07T20:33:11.8518892Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8518993Z def test_silu_mul_quant( 2025-05-07T20:33:11.8519077Z self, 2025-05-07T20:33:11.8519166Z T: int, 2025-05-07T20:33:11.8519243Z D: int, 2025-05-07T20:33:11.8519343Z scale_ub: Optional[float], 2025-05-07T20:33:11.8519533Z contiguous: bool, 2025-05-07T20:33:11.8519620Z compiled: bool, 2025-05-07T20:33:11.8519703Z ) -> None: 2025-05-07T20:33:11.8519805Z torch.manual_seed(2025) 2025-05-07T20:33:11.8519878Z 2025-05-07T20:33:11.8520050Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8520207Z 2025-05-07T20:33:11.8520301Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8520436Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8520524Z x = x_sign * x_clamp 2025-05-07T20:33:11.8520646Z x0 = x[:, :D] 2025-05-07T20:33:11.8520741Z x1 = x[:, D:] 2025-05-07T20:33:11.8520818Z 2025-05-07T20:33:11.8520912Z if contiguous: 2025-05-07T20:33:11.8521019Z x0 = x0.contiguous() 2025-05-07T20:33:11.8521114Z x1 = x1.contiguous() 2025-05-07T20:33:11.8521192Z 2025-05-07T20:33:11.8521297Z if scale_ub is not None: 2025-05-07T20:33:11.8521410Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8521550Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8521638Z ) 2025-05-07T20:33:11.8521713Z else: 2025-05-07T20:33:11.8521814Z scale_ub_tensor = None 2025-05-07T20:33:11.8521889Z 2025-05-07T20:33:11.8522028Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8522132Z op = silu_mul_quant 2025-05-07T20:33:11.8522223Z if compiled: 2025-05-07T20:33:11.8522330Z op = torch.compile(op) 2025-05-07T20:33:11.8522458Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8522538Z 2025-05-07T20:33:11.8522636Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8522641Z 2025-05-07T20:33:11.8522754Z moe/activation_test.py:117: 2025-05-07T20:33:11.8522891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8523000Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8523119Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8523510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8523616Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8524149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8524252Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8524639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8524874Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8525245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8525343Z kernel = self.compile( 2025-05-07T20:33:11.8526164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8526369Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8526503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8526508Z 2025-05-07T20:33:11.8526724Z self = 2025-05-07T20:33:11.8527546Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8528065Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857369f80>} 2025-05-07T20:33:11.8528866Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8529247Z context = 2025-05-07T20:33:11.8529252Z 2025-05-07T20:33:11.8529433Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8529838Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8529952Z module_map=module_map) 2025-05-07T20:33:11.8530130Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8530302Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8530381Z E ^ 2025-05-07T20:33:11.8530762Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8530766Z 2025-05-07T20:33:11.8531203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8531210Z 2025-05-07T20:33:11.8531323Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8531553Z self=, 2025-05-07T20:33:11.8531634Z T=1, 2025-05-07T20:33:11.8531718Z D=7168, 2025-05-07T20:33:11.8531808Z scale_ub=None, 2025-05-07T20:33:11.8531898Z contiguous=False, 2025-05-07T20:33:11.8531992Z compiled=True, 2025-05-07T20:33:11.8532071Z ) 2025-05-07T20:33:11.8532304Z self = 2025-05-07T20:33:11.8532477Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:11.8532482Z 2025-05-07T20:33:11.8532562Z @given( 2025-05-07T20:33:11.8532690Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8532793Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8532909Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8533036Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8533153Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8533225Z ) 2025-05-07T20:33:11.8533490Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8533585Z def test_silu_mul_quant( 2025-05-07T20:33:11.8533676Z self, 2025-05-07T20:33:11.8533758Z T: int, 2025-05-07T20:33:11.8533835Z D: int, 2025-05-07T20:33:11.8533946Z scale_ub: Optional[float], 2025-05-07T20:33:11.8534034Z contiguous: bool, 2025-05-07T20:33:11.8534120Z compiled: bool, 2025-05-07T20:33:11.8534206Z ) -> None: 2025-05-07T20:33:11.8534301Z torch.manual_seed(2025) 2025-05-07T20:33:11.8534513Z 2025-05-07T20:33:11.8534694Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8534768Z 2025-05-07T20:33:11.8534860Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8534995Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8535088Z x = x_sign * x_clamp 2025-05-07T20:33:11.8535177Z x0 = x[:, :D] 2025-05-07T20:33:11.8535256Z x1 = x[:, D:] 2025-05-07T20:33:11.8535333Z 2025-05-07T20:33:11.8535428Z if contiguous: 2025-05-07T20:33:11.8535521Z x0 = x0.contiguous() 2025-05-07T20:33:11.8535617Z x1 = x1.contiguous() 2025-05-07T20:33:11.8535697Z 2025-05-07T20:33:11.8535788Z if scale_ub is not None: 2025-05-07T20:33:11.8535893Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8536037Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8536113Z ) 2025-05-07T20:33:11.8536191Z else: 2025-05-07T20:33:11.8536296Z scale_ub_tensor = None 2025-05-07T20:33:11.8536368Z 2025-05-07T20:33:11.8536497Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8536596Z op = silu_mul_quant 2025-05-07T20:33:11.8536679Z if compiled: 2025-05-07T20:33:11.8536841Z op = torch.compile(op) 2025-05-07T20:33:11.8536948Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8537018Z 2025-05-07T20:33:11.8537117Z y_fp8, y_scale = fn() 2025-05-07T20:33:11.8537239Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:11.8537391Z 2025-05-07T20:33:11.8537539Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8537644Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:11.8537743Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:11.8537910Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:11.8538052Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8538135Z 2025-05-07T20:33:11.8538235Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:11.8538240Z 2025-05-07T20:33:11.8538339Z moe/activation_test.py:126: 2025-05-07T20:33:11.8538479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8538591Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:11.8538727Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:11.8539394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:11.8539499Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:11.8539886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8540122Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8540512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:11.8540789Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:11.8541186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:11.8541371Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:11.8541736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:11.8541821Z fn() 2025-05-07T20:33:11.8542254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:11.8542340Z self.fn.run( 2025-05-07T20:33:11.8542700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8542804Z kernel = self.compile( 2025-05-07T20:33:11.8543207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8543399Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8543534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8543539Z 2025-05-07T20:33:11.8543748Z self = 2025-05-07T20:33:11.8544574Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8545092Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f985736aca0>} 2025-05-07T20:33:11.8545898Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8546100Z context = 2025-05-07T20:33:11.8546150Z 2025-05-07T20:33:11.8546321Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8546609Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8546791Z module_map=module_map) 2025-05-07T20:33:11.8546966Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8547072Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:11.8547155Z E ^ 2025-05-07T20:33:11.8547535Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8547579Z 2025-05-07T20:33:11.8548016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8548021Z 2025-05-07T20:33:11.8548136Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8548368Z self=, 2025-05-07T20:33:11.8548447Z T=1, 2025-05-07T20:33:11.8548539Z D=5120, 2025-05-07T20:33:11.8548623Z scale_ub=1200.0, 2025-05-07T20:33:11.8548713Z contiguous=False, 2025-05-07T20:33:11.8548807Z compiled=True, 2025-05-07T20:33:11.8548888Z ) 2025-05-07T20:33:11.8549122Z self = 2025-05-07T20:33:11.8549298Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:11.8549303Z 2025-05-07T20:33:11.8549382Z @given( 2025-05-07T20:33:11.8549519Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8549621Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8549736Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8549860Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8549973Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8550049Z ) 2025-05-07T20:33:11.8550315Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8550409Z def test_silu_mul_quant( 2025-05-07T20:33:11.8550484Z self, 2025-05-07T20:33:11.8550568Z T: int, 2025-05-07T20:33:11.8550646Z D: int, 2025-05-07T20:33:11.8550750Z scale_ub: Optional[float], 2025-05-07T20:33:11.8550850Z contiguous: bool, 2025-05-07T20:33:11.8550939Z compiled: bool, 2025-05-07T20:33:11.8551026Z ) -> None: 2025-05-07T20:33:11.8551120Z torch.manual_seed(2025) 2025-05-07T20:33:11.8551201Z 2025-05-07T20:33:11.8551380Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8551458Z 2025-05-07T20:33:11.8551551Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8551686Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8551778Z x = x_sign * x_clamp 2025-05-07T20:33:11.8551865Z x0 = x[:, :D] 2025-05-07T20:33:11.8551958Z x1 = x[:, D:] 2025-05-07T20:33:11.8552038Z 2025-05-07T20:33:11.8552128Z if contiguous: 2025-05-07T20:33:11.8552228Z x0 = x0.contiguous() 2025-05-07T20:33:11.8552322Z x1 = x1.contiguous() 2025-05-07T20:33:11.8552398Z 2025-05-07T20:33:11.8552499Z if scale_ub is not None: 2025-05-07T20:33:11.8552617Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8552763Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8552846Z ) 2025-05-07T20:33:11.8552931Z else: 2025-05-07T20:33:11.8553038Z scale_ub_tensor = None 2025-05-07T20:33:11.8553116Z 2025-05-07T20:33:11.8553246Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8553344Z op = silu_mul_quant 2025-05-07T20:33:11.8553429Z if compiled: 2025-05-07T20:33:11.8553528Z op = torch.compile(op) 2025-05-07T20:33:11.8553641Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8553766Z 2025-05-07T20:33:11.8553856Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8553868Z 2025-05-07T20:33:11.8553965Z moe/activation_test.py:117: 2025-05-07T20:33:11.8554095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8554289Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8554397Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8554783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8554881Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8555470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8555585Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8555960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8556196Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8556560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8556654Z kernel = self.compile( 2025-05-07T20:33:11.8557063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8557249Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8557378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8557386Z 2025-05-07T20:33:11.8557599Z self = 2025-05-07T20:33:11.8558411Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8558930Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857b74400>} 2025-05-07T20:33:11.8559730Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8559925Z context = 2025-05-07T20:33:11.8559929Z 2025-05-07T20:33:11.8560117Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8560391Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8560504Z module_map=module_map) 2025-05-07T20:33:11.8560683Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8560783Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8560861Z E ^ 2025-05-07T20:33:11.8561237Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8561241Z 2025-05-07T20:33:11.8561681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8561686Z 2025-05-07T20:33:11.8561795Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8562024Z self=, 2025-05-07T20:33:11.8562121Z T=1, 2025-05-07T20:33:11.8562196Z D=5120, 2025-05-07T20:33:11.8562279Z scale_ub=1200.0, 2025-05-07T20:33:11.8562376Z contiguous=False, 2025-05-07T20:33:11.8562464Z compiled=False, 2025-05-07T20:33:11.8562539Z ) 2025-05-07T20:33:11.8562773Z self = 2025-05-07T20:33:11.8562944Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:11.8562996Z 2025-05-07T20:33:11.8563088Z @given( 2025-05-07T20:33:11.8563212Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8563315Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8563516Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8563640Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8563757Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8563845Z ) 2025-05-07T20:33:11.8564099Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8564240Z def test_silu_mul_quant( 2025-05-07T20:33:11.8564324Z self, 2025-05-07T20:33:11.8564404Z T: int, 2025-05-07T20:33:11.8564490Z D: int, 2025-05-07T20:33:11.8564595Z scale_ub: Optional[float], 2025-05-07T20:33:11.8564688Z contiguous: bool, 2025-05-07T20:33:11.8564783Z compiled: bool, 2025-05-07T20:33:11.8564865Z ) -> None: 2025-05-07T20:33:11.8564962Z torch.manual_seed(2025) 2025-05-07T20:33:11.8565042Z 2025-05-07T20:33:11.8565211Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8565289Z 2025-05-07T20:33:11.8565388Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8565522Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8565617Z x = x_sign * x_clamp 2025-05-07T20:33:11.8565706Z x0 = x[:, :D] 2025-05-07T20:33:11.8565789Z x1 = x[:, D:] 2025-05-07T20:33:11.8565876Z 2025-05-07T20:33:11.8565963Z if contiguous: 2025-05-07T20:33:11.8566062Z x0 = x0.contiguous() 2025-05-07T20:33:11.8566159Z x1 = x1.contiguous() 2025-05-07T20:33:11.8566236Z 2025-05-07T20:33:11.8566328Z if scale_ub is not None: 2025-05-07T20:33:11.8566443Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8566580Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8566661Z ) 2025-05-07T20:33:11.8566746Z else: 2025-05-07T20:33:11.8566846Z scale_ub_tensor = None 2025-05-07T20:33:11.8566923Z 2025-05-07T20:33:11.8567060Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8567159Z op = silu_mul_quant 2025-05-07T20:33:11.8567253Z if compiled: 2025-05-07T20:33:11.8567356Z op = torch.compile(op) 2025-05-07T20:33:11.8567464Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8567545Z 2025-05-07T20:33:11.8567641Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8567645Z 2025-05-07T20:33:11.8567744Z moe/activation_test.py:117: 2025-05-07T20:33:11.8567886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8567990Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8568092Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8568623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8568724Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8569132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8569390Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8569746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8569848Z kernel = self.compile( 2025-05-07T20:33:11.8570248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8570425Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8570566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8570571Z 2025-05-07T20:33:11.8570835Z self = 2025-05-07T20:33:11.8571731Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8572247Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f994223c2c0>} 2025-05-07T20:33:11.8573047Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8573278Z context = 2025-05-07T20:33:11.8573282Z 2025-05-07T20:33:11.8573452Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8573731Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8573838Z module_map=module_map) 2025-05-07T20:33:11.8574006Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8574110Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8574189Z E ^ 2025-05-07T20:33:11.8574704Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8574709Z 2025-05-07T20:33:11.8575151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8575156Z 2025-05-07T20:33:11.8575273Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8575502Z self=, 2025-05-07T20:33:11.8575580Z T=16384, 2025-05-07T20:33:11.8575665Z D=5120, 2025-05-07T20:33:11.8575749Z scale_ub=1200.0, 2025-05-07T20:33:11.8575837Z contiguous=False, 2025-05-07T20:33:11.8575928Z compiled=True, 2025-05-07T20:33:11.8576007Z ) 2025-05-07T20:33:11.8576233Z self = 2025-05-07T20:33:11.8576430Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:11.8576434Z 2025-05-07T20:33:11.8576509Z @given( 2025-05-07T20:33:11.8576628Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8576738Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8576859Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8576985Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8577102Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8577179Z ) 2025-05-07T20:33:11.8577440Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8577538Z def test_silu_mul_quant( 2025-05-07T20:33:11.8577621Z self, 2025-05-07T20:33:11.8577704Z T: int, 2025-05-07T20:33:11.8577783Z D: int, 2025-05-07T20:33:11.8577882Z scale_ub: Optional[float], 2025-05-07T20:33:11.8577979Z contiguous: bool, 2025-05-07T20:33:11.8578075Z compiled: bool, 2025-05-07T20:33:11.8578159Z ) -> None: 2025-05-07T20:33:11.8578253Z torch.manual_seed(2025) 2025-05-07T20:33:11.8578327Z 2025-05-07T20:33:11.8578504Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8578581Z 2025-05-07T20:33:11.8578676Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8578810Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8578902Z x = x_sign * x_clamp 2025-05-07T20:33:11.8578986Z x0 = x[:, :D] 2025-05-07T20:33:11.8579077Z x1 = x[:, D:] 2025-05-07T20:33:11.8579159Z 2025-05-07T20:33:11.8579247Z if contiguous: 2025-05-07T20:33:11.8579406Z x0 = x0.contiguous() 2025-05-07T20:33:11.8579499Z x1 = x1.contiguous() 2025-05-07T20:33:11.8579577Z 2025-05-07T20:33:11.8579680Z if scale_ub is not None: 2025-05-07T20:33:11.8579792Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8580655Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8580742Z ) 2025-05-07T20:33:11.8580825Z else: 2025-05-07T20:33:11.8580933Z scale_ub_tensor = None 2025-05-07T20:33:11.8581006Z 2025-05-07T20:33:11.8581135Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8581275Z op = silu_mul_quant 2025-05-07T20:33:11.8581359Z if compiled: 2025-05-07T20:33:11.8581457Z op = torch.compile(op) 2025-05-07T20:33:11.8581571Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8581646Z 2025-05-07T20:33:11.8581736Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8581748Z 2025-05-07T20:33:11.8581850Z moe/activation_test.py:117: 2025-05-07T20:33:11.8581981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8582089Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8582188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8582577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8582674Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8583194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8583295Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8583676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8583903Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8584263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8584359Z kernel = self.compile( 2025-05-07T20:33:11.8584763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8584951Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8585082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8585086Z 2025-05-07T20:33:11.8585302Z self = 2025-05-07T20:33:11.8586114Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8586626Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9942295da0>} 2025-05-07T20:33:11.8587423Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8587618Z context = 2025-05-07T20:33:11.8587623Z 2025-05-07T20:33:11.8587799Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8588071Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8588179Z module_map=module_map) 2025-05-07T20:33:11.8588350Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8588448Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8588533Z E ^ 2025-05-07T20:33:11.8588904Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8588957Z 2025-05-07T20:33:11.8589394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8589399Z 2025-05-07T20:33:11.8589610Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8589841Z self=, 2025-05-07T20:33:11.8589928Z T=2048, 2025-05-07T20:33:11.8590007Z D=7168, 2025-05-07T20:33:11.8590088Z scale_ub=1200.0, 2025-05-07T20:33:11.8590223Z contiguous=False, 2025-05-07T20:33:11.8590310Z compiled=True, 2025-05-07T20:33:11.8590389Z ) 2025-05-07T20:33:11.8590621Z self = 2025-05-07T20:33:11.8590802Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:11.8590807Z 2025-05-07T20:33:11.8590885Z @given( 2025-05-07T20:33:11.8591016Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8591119Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8591241Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8591363Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8591484Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8591569Z ) 2025-05-07T20:33:11.8591822Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8591918Z def test_silu_mul_quant( 2025-05-07T20:33:11.8592011Z self, 2025-05-07T20:33:11.8592093Z T: int, 2025-05-07T20:33:11.8592172Z D: int, 2025-05-07T20:33:11.8592283Z scale_ub: Optional[float], 2025-05-07T20:33:11.8592375Z contiguous: bool, 2025-05-07T20:33:11.8592462Z compiled: bool, 2025-05-07T20:33:11.8592547Z ) -> None: 2025-05-07T20:33:11.8592641Z torch.manual_seed(2025) 2025-05-07T20:33:11.8592720Z 2025-05-07T20:33:11.8592897Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8592974Z 2025-05-07T20:33:11.8593074Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8593197Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8593294Z x = x_sign * x_clamp 2025-05-07T20:33:11.8593386Z x0 = x[:, :D] 2025-05-07T20:33:11.8593467Z x1 = x[:, D:] 2025-05-07T20:33:11.8593544Z 2025-05-07T20:33:11.8593638Z if contiguous: 2025-05-07T20:33:11.8593735Z x0 = x0.contiguous() 2025-05-07T20:33:11.8593829Z x1 = x1.contiguous() 2025-05-07T20:33:11.8593915Z 2025-05-07T20:33:11.8594009Z if scale_ub is not None: 2025-05-07T20:33:11.8594117Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8594257Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8594334Z ) 2025-05-07T20:33:11.8594421Z else: 2025-05-07T20:33:11.8594518Z scale_ub_tensor = None 2025-05-07T20:33:11.8594590Z 2025-05-07T20:33:11.8594732Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8594826Z op = silu_mul_quant 2025-05-07T20:33:11.8594914Z if compiled: 2025-05-07T20:33:11.8595030Z op = torch.compile(op) 2025-05-07T20:33:11.8595139Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8595217Z 2025-05-07T20:33:11.8595317Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8595322Z 2025-05-07T20:33:11.8595423Z moe/activation_test.py:117: 2025-05-07T20:33:11.8595566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8595669Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8595772Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8596170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8596263Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8596835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8596941Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8597389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8597625Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8597980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8598110Z kernel = self.compile( 2025-05-07T20:33:11.8598519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8598697Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8598824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8598838Z 2025-05-07T20:33:11.8599069Z self = 2025-05-07T20:33:11.8599908Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8600428Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f99422949a0>} 2025-05-07T20:33:11.8601222Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8601422Z context = 2025-05-07T20:33:11.8601427Z 2025-05-07T20:33:11.8601596Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8601865Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8601977Z module_map=module_map) 2025-05-07T20:33:11.8602144Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8602243Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8602329Z E ^ 2025-05-07T20:33:11.8602695Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8602702Z 2025-05-07T20:33:11.8603148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8603152Z 2025-05-07T20:33:11.8603258Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8603484Z self=, 2025-05-07T20:33:11.8603575Z T=1, 2025-05-07T20:33:11.8603655Z D=5120, 2025-05-07T20:33:11.8603748Z scale_ub=None, 2025-05-07T20:33:11.8603838Z contiguous=False, 2025-05-07T20:33:11.8603925Z compiled=False, 2025-05-07T20:33:11.8604007Z ) 2025-05-07T20:33:11.8604238Z self = 2025-05-07T20:33:11.8604410Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:11.8604414Z 2025-05-07T20:33:11.8604500Z @given( 2025-05-07T20:33:11.8604625Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8604727Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8604851Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8604969Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8605090Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8605167Z ) 2025-05-07T20:33:11.8605418Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8605575Z def test_silu_mul_quant( 2025-05-07T20:33:11.8605653Z self, 2025-05-07T20:33:11.8605734Z T: int, 2025-05-07T20:33:11.8605822Z D: int, 2025-05-07T20:33:11.8605924Z scale_ub: Optional[float], 2025-05-07T20:33:11.8606091Z contiguous: bool, 2025-05-07T20:33:11.8606187Z compiled: bool, 2025-05-07T20:33:11.8606269Z ) -> None: 2025-05-07T20:33:11.8606367Z torch.manual_seed(2025) 2025-05-07T20:33:11.8606447Z 2025-05-07T20:33:11.8606621Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8606739Z 2025-05-07T20:33:11.8606844Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8606970Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8607072Z x = x_sign * x_clamp 2025-05-07T20:33:11.8607158Z x0 = x[:, :D] 2025-05-07T20:33:11.8607243Z x1 = x[:, D:] 2025-05-07T20:33:11.8607324Z 2025-05-07T20:33:11.8607414Z if contiguous: 2025-05-07T20:33:11.8607508Z x0 = x0.contiguous() 2025-05-07T20:33:11.8607610Z x1 = x1.contiguous() 2025-05-07T20:33:11.8607685Z 2025-05-07T20:33:11.8607779Z if scale_ub is not None: 2025-05-07T20:33:11.8607898Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8608039Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8608118Z ) 2025-05-07T20:33:11.8608205Z else: 2025-05-07T20:33:11.8608302Z scale_ub_tensor = None 2025-05-07T20:33:11.8608389Z 2025-05-07T20:33:11.8608520Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8608613Z op = silu_mul_quant 2025-05-07T20:33:11.8608708Z if compiled: 2025-05-07T20:33:11.8608809Z op = torch.compile(op) 2025-05-07T20:33:11.8608917Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8608999Z 2025-05-07T20:33:11.8609097Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8609101Z 2025-05-07T20:33:11.8609201Z moe/activation_test.py:117: 2025-05-07T20:33:11.8609339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8609442Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8609554Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8610082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8610182Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8610571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8610800Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8611159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8611262Z kernel = self.compile( 2025-05-07T20:33:11.8611667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8611855Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8611990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8611994Z 2025-05-07T20:33:11.8612212Z self = 2025-05-07T20:33:11.8613036Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8613551Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9942713880>} 2025-05-07T20:33:11.8614475Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8614726Z context = 2025-05-07T20:33:11.8614730Z 2025-05-07T20:33:11.8614984Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8615262Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8615372Z module_map=module_map) 2025-05-07T20:33:11.8615584Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8615686Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8615765Z E ^ 2025-05-07T20:33:11.8616144Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8616148Z 2025-05-07T20:33:11.8616591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8616595Z 2025-05-07T20:33:11.8616710Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8616939Z self=, 2025-05-07T20:33:11.8617026Z T=4096, 2025-05-07T20:33:11.8617111Z D=7168, 2025-05-07T20:33:11.8617199Z scale_ub=1200.0, 2025-05-07T20:33:11.8617288Z contiguous=False, 2025-05-07T20:33:11.8617381Z compiled=False, 2025-05-07T20:33:11.8617460Z ) 2025-05-07T20:33:11.8617689Z self = 2025-05-07T20:33:11.8617879Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:11.8617884Z 2025-05-07T20:33:11.8617964Z @given( 2025-05-07T20:33:11.8618091Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8618192Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8618316Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8618442Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8618557Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8618634Z ) 2025-05-07T20:33:11.8618902Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8618999Z def test_silu_mul_quant( 2025-05-07T20:33:11.8619079Z self, 2025-05-07T20:33:11.8619165Z T: int, 2025-05-07T20:33:11.8619242Z D: int, 2025-05-07T20:33:11.8619352Z scale_ub: Optional[float], 2025-05-07T20:33:11.8619445Z contiguous: bool, 2025-05-07T20:33:11.8619534Z compiled: bool, 2025-05-07T20:33:11.8619621Z ) -> None: 2025-05-07T20:33:11.8619720Z torch.manual_seed(2025) 2025-05-07T20:33:11.8619798Z 2025-05-07T20:33:11.8619980Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8620058Z 2025-05-07T20:33:11.8620156Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8620291Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8620382Z x = x_sign * x_clamp 2025-05-07T20:33:11.8620465Z x0 = x[:, :D] 2025-05-07T20:33:11.8620552Z x1 = x[:, D:] 2025-05-07T20:33:11.8620631Z 2025-05-07T20:33:11.8620723Z if contiguous: 2025-05-07T20:33:11.8620817Z x0 = x0.contiguous() 2025-05-07T20:33:11.8620910Z x1 = x1.contiguous() 2025-05-07T20:33:11.8620994Z 2025-05-07T20:33:11.8621089Z if scale_ub is not None: 2025-05-07T20:33:11.8621200Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8621344Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8621421Z ) 2025-05-07T20:33:11.8621501Z else: 2025-05-07T20:33:11.8621605Z scale_ub_tensor = None 2025-05-07T20:33:11.8621680Z 2025-05-07T20:33:11.8621812Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8621967Z op = silu_mul_quant 2025-05-07T20:33:11.8622055Z if compiled: 2025-05-07T20:33:11.8622156Z op = torch.compile(op) 2025-05-07T20:33:11.8622290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8622365Z 2025-05-07T20:33:11.8622572Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8622577Z 2025-05-07T20:33:11.8629533Z moe/activation_test.py:117: 2025-05-07T20:33:11.8629708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8630062Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8630170Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8630705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8630816Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8631199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8631434Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8631805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8631913Z kernel = self.compile( 2025-05-07T20:33:11.8632334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8632517Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8632659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8632665Z 2025-05-07T20:33:11.8632890Z self = 2025-05-07T20:33:11.8633706Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8634239Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9949700e00>} 2025-05-07T20:33:11.8635040Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8635238Z context = 2025-05-07T20:33:11.8635257Z 2025-05-07T20:33:11.8635432Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8635710Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8635839Z module_map=module_map) 2025-05-07T20:33:11.8636012Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8636123Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8636218Z E ^ 2025-05-07T20:33:11.8636595Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8636600Z 2025-05-07T20:33:11.8637052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8637057Z 2025-05-07T20:33:11.8637169Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8637404Z self=, 2025-05-07T20:33:11.8637497Z T=16384, 2025-05-07T20:33:11.8637581Z D=7168, 2025-05-07T20:33:11.8637668Z scale_ub=None, 2025-05-07T20:33:11.8637771Z contiguous=True, 2025-05-07T20:33:11.8637859Z compiled=True, 2025-05-07T20:33:11.8637941Z ) 2025-05-07T20:33:11.8638177Z self = 2025-05-07T20:33:11.8638446Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:11.8638450Z 2025-05-07T20:33:11.8638543Z @given( 2025-05-07T20:33:11.8638667Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8638923Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8639054Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8639176Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8639293Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8639423Z ) 2025-05-07T20:33:11.8639679Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8639778Z def test_silu_mul_quant( 2025-05-07T20:33:11.8639865Z self, 2025-05-07T20:33:11.8639946Z T: int, 2025-05-07T20:33:11.8640034Z D: int, 2025-05-07T20:33:11.8640136Z scale_ub: Optional[float], 2025-05-07T20:33:11.8640233Z contiguous: bool, 2025-05-07T20:33:11.8640333Z compiled: bool, 2025-05-07T20:33:11.8640417Z ) -> None: 2025-05-07T20:33:11.8640518Z torch.manual_seed(2025) 2025-05-07T20:33:11.8640604Z 2025-05-07T20:33:11.8640778Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8640862Z 2025-05-07T20:33:11.8640967Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8641099Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8641194Z x = x_sign * x_clamp 2025-05-07T20:33:11.8641291Z x0 = x[:, :D] 2025-05-07T20:33:11.8641383Z x1 = x[:, D:] 2025-05-07T20:33:11.8641473Z 2025-05-07T20:33:11.8641569Z if contiguous: 2025-05-07T20:33:11.8641669Z x0 = x0.contiguous() 2025-05-07T20:33:11.8641775Z x1 = x1.contiguous() 2025-05-07T20:33:11.8641855Z 2025-05-07T20:33:11.8641958Z if scale_ub is not None: 2025-05-07T20:33:11.8642082Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8642225Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8642305Z ) 2025-05-07T20:33:11.8642395Z else: 2025-05-07T20:33:11.8642492Z scale_ub_tensor = None 2025-05-07T20:33:11.8642569Z 2025-05-07T20:33:11.8642717Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8642818Z op = silu_mul_quant 2025-05-07T20:33:11.8642910Z if compiled: 2025-05-07T20:33:11.8643030Z op = torch.compile(op) 2025-05-07T20:33:11.8643142Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8643234Z 2025-05-07T20:33:11.8643333Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8643338Z 2025-05-07T20:33:11.8643444Z moe/activation_test.py:117: 2025-05-07T20:33:11.8643594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8643704Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8643809Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8644210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8644308Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8644848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8644951Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8645331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8645578Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8645939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8646037Z kernel = self.compile( 2025-05-07T20:33:11.8646458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8646694Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8646836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8646840Z 2025-05-07T20:33:11.8647138Z self = 2025-05-07T20:33:11.8647956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8648532Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f994946ae80>} 2025-05-07T20:33:11.8649328Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8649537Z context = 2025-05-07T20:33:11.8649542Z 2025-05-07T20:33:11.8649715Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8650004Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8650119Z module_map=module_map) 2025-05-07T20:33:11.8650286Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8650405Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8650489Z E ^ 2025-05-07T20:33:11.8650861Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8650866Z 2025-05-07T20:33:11.8651316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8651323Z 2025-05-07T20:33:11.8651430Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8651670Z self=, 2025-05-07T20:33:11.8651753Z T=4096, 2025-05-07T20:33:11.8651834Z D=5120, 2025-05-07T20:33:11.8651929Z scale_ub=None, 2025-05-07T20:33:11.8652027Z contiguous=False, 2025-05-07T20:33:11.8652112Z compiled=True, 2025-05-07T20:33:11.8652196Z ) 2025-05-07T20:33:11.8652422Z self = 2025-05-07T20:33:11.8652607Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:11.8652620Z 2025-05-07T20:33:11.8652701Z @given( 2025-05-07T20:33:11.8652823Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8652933Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8653051Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8653173Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8653303Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8653380Z ) 2025-05-07T20:33:11.8653634Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8653741Z def test_silu_mul_quant( 2025-05-07T20:33:11.8653826Z self, 2025-05-07T20:33:11.8653907Z T: int, 2025-05-07T20:33:11.8653993Z D: int, 2025-05-07T20:33:11.8654094Z scale_ub: Optional[float], 2025-05-07T20:33:11.8654195Z contiguous: bool, 2025-05-07T20:33:11.8654288Z compiled: bool, 2025-05-07T20:33:11.8654554Z ) -> None: 2025-05-07T20:33:11.8654659Z torch.manual_seed(2025) 2025-05-07T20:33:11.8654736Z 2025-05-07T20:33:11.8654909Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8654996Z 2025-05-07T20:33:11.8655091Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8655222Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8655378Z x = x_sign * x_clamp 2025-05-07T20:33:11.8655464Z x0 = x[:, :D] 2025-05-07T20:33:11.8655548Z x1 = x[:, D:] 2025-05-07T20:33:11.8655632Z 2025-05-07T20:33:11.8655719Z if contiguous: 2025-05-07T20:33:11.8655824Z x0 = x0.contiguous() 2025-05-07T20:33:11.8655999Z x1 = x1.contiguous() 2025-05-07T20:33:11.8656077Z 2025-05-07T20:33:11.8656180Z if scale_ub is not None: 2025-05-07T20:33:11.8656288Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8656426Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8656555Z ) 2025-05-07T20:33:11.8656636Z else: 2025-05-07T20:33:11.8656737Z scale_ub_tensor = None 2025-05-07T20:33:11.8656824Z 2025-05-07T20:33:11.8656958Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8657053Z op = silu_mul_quant 2025-05-07T20:33:11.8657149Z if compiled: 2025-05-07T20:33:11.8657256Z op = torch.compile(op) 2025-05-07T20:33:11.8657368Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8657453Z 2025-05-07T20:33:11.8657550Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8657554Z 2025-05-07T20:33:11.8657662Z moe/activation_test.py:117: 2025-05-07T20:33:11.8657804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8657911Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8658024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8658416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8658515Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8659083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8659200Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8659590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8659822Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8660188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8660295Z kernel = self.compile( 2025-05-07T20:33:11.8660702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8660895Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8661031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8661035Z 2025-05-07T20:33:11.8661247Z self = 2025-05-07T20:33:11.8662069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8662597Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857cb7ba0>} 2025-05-07T20:33:11.8663405Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8663604Z context = 2025-05-07T20:33:11.8663608Z 2025-05-07T20:33:11.8663781Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8664070Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8664180Z module_map=module_map) 2025-05-07T20:33:11.8664447Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8664554Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8664636Z E ^ 2025-05-07T20:33:11.8665097Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8665103Z 2025-05-07T20:33:11.8665544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8665549Z 2025-05-07T20:33:11.8665711Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8665944Z self=, 2025-05-07T20:33:11.8666027Z T=4096, 2025-05-07T20:33:11.8666119Z D=5120, 2025-05-07T20:33:11.8666208Z scale_ub=1200.0, 2025-05-07T20:33:11.8666302Z contiguous=False, 2025-05-07T20:33:11.8666398Z compiled=False, 2025-05-07T20:33:11.8666479Z ) 2025-05-07T20:33:11.8666714Z self = 2025-05-07T20:33:11.8666910Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:11.8666915Z 2025-05-07T20:33:11.8666996Z @given( 2025-05-07T20:33:11.8667143Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8667249Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8667371Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8667503Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8667624Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8667701Z ) 2025-05-07T20:33:11.8667966Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8668067Z def test_silu_mul_quant( 2025-05-07T20:33:11.8668147Z self, 2025-05-07T20:33:11.8668238Z T: int, 2025-05-07T20:33:11.8668318Z D: int, 2025-05-07T20:33:11.8668425Z scale_ub: Optional[float], 2025-05-07T20:33:11.8668529Z contiguous: bool, 2025-05-07T20:33:11.8668620Z compiled: bool, 2025-05-07T20:33:11.8668712Z ) -> None: 2025-05-07T20:33:11.8668814Z torch.manual_seed(2025) 2025-05-07T20:33:11.8668899Z 2025-05-07T20:33:11.8669118Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8669208Z 2025-05-07T20:33:11.8669304Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8669444Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8669544Z x = x_sign * x_clamp 2025-05-07T20:33:11.8669635Z x0 = x[:, :D] 2025-05-07T20:33:11.8669731Z x1 = x[:, D:] 2025-05-07T20:33:11.8669811Z 2025-05-07T20:33:11.8669901Z if contiguous: 2025-05-07T20:33:11.8670012Z x0 = x0.contiguous() 2025-05-07T20:33:11.8670110Z x1 = x1.contiguous() 2025-05-07T20:33:11.8670193Z 2025-05-07T20:33:11.8670297Z if scale_ub is not None: 2025-05-07T20:33:11.8670415Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8670565Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8670647Z ) 2025-05-07T20:33:11.8670729Z else: 2025-05-07T20:33:11.8670838Z scale_ub_tensor = None 2025-05-07T20:33:11.8670921Z 2025-05-07T20:33:11.8671060Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8671167Z op = silu_mul_quant 2025-05-07T20:33:11.8671262Z if compiled: 2025-05-07T20:33:11.8671372Z op = torch.compile(op) 2025-05-07T20:33:11.8671497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8671577Z 2025-05-07T20:33:11.8671678Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8671691Z 2025-05-07T20:33:11.8671795Z moe/activation_test.py:117: 2025-05-07T20:33:11.8671935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8672109Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8672216Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8672747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8672858Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8673320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8673562Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8673968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8674067Z kernel = self.compile( 2025-05-07T20:33:11.8674484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8674666Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8674807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8674811Z 2025-05-07T20:33:11.8675031Z self = 2025-05-07T20:33:11.8675849Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8676378Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f985783a2a0>} 2025-05-07T20:33:11.8677177Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8677382Z context = 2025-05-07T20:33:11.8677390Z 2025-05-07T20:33:11.8677563Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8677838Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8677966Z module_map=module_map) 2025-05-07T20:33:11.8678135Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8678239Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8678332Z E ^ 2025-05-07T20:33:11.8678708Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8678716Z 2025-05-07T20:33:11.8679167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8679171Z 2025-05-07T20:33:11.8679281Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8679520Z self=, 2025-05-07T20:33:11.8679615Z T=4096, 2025-05-07T20:33:11.8679699Z D=5120, 2025-05-07T20:33:11.8679788Z scale_ub=1200.0, 2025-05-07T20:33:11.8679887Z contiguous=False, 2025-05-07T20:33:11.8679976Z compiled=True, 2025-05-07T20:33:11.8680065Z ) 2025-05-07T20:33:11.8680297Z self = 2025-05-07T20:33:11.8680480Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:11.8680484Z 2025-05-07T20:33:11.8680575Z @given( 2025-05-07T20:33:11.8680701Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8680805Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8680934Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8681056Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8681186Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8681319Z ) 2025-05-07T20:33:11.8681578Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8681692Z def test_silu_mul_quant( 2025-05-07T20:33:11.8681774Z self, 2025-05-07T20:33:11.8681858Z T: int, 2025-05-07T20:33:11.8681949Z D: int, 2025-05-07T20:33:11.8682133Z scale_ub: Optional[float], 2025-05-07T20:33:11.8682230Z contiguous: bool, 2025-05-07T20:33:11.8682328Z compiled: bool, 2025-05-07T20:33:11.8682411Z ) -> None: 2025-05-07T20:33:11.8682510Z torch.manual_seed(2025) 2025-05-07T20:33:11.8682637Z 2025-05-07T20:33:11.8682813Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8682892Z 2025-05-07T20:33:11.8682999Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8683133Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8683235Z x = x_sign * x_clamp 2025-05-07T20:33:11.8683319Z x0 = x[:, :D] 2025-05-07T20:33:11.8683406Z x1 = x[:, D:] 2025-05-07T20:33:11.8683489Z 2025-05-07T20:33:11.8683575Z if contiguous: 2025-05-07T20:33:11.8683669Z x0 = x0.contiguous() 2025-05-07T20:33:11.8683770Z x1 = x1.contiguous() 2025-05-07T20:33:11.8683847Z 2025-05-07T20:33:11.8683947Z if scale_ub is not None: 2025-05-07T20:33:11.8684065Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8684204Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8684285Z ) 2025-05-07T20:33:11.8684376Z else: 2025-05-07T20:33:11.8684475Z scale_ub_tensor = None 2025-05-07T20:33:11.8684559Z 2025-05-07T20:33:11.8684694Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8684792Z op = silu_mul_quant 2025-05-07T20:33:11.8684895Z if compiled: 2025-05-07T20:33:11.8685001Z op = torch.compile(op) 2025-05-07T20:33:11.8685113Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8685204Z 2025-05-07T20:33:11.8685305Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8685310Z 2025-05-07T20:33:11.8685413Z moe/activation_test.py:117: 2025-05-07T20:33:11.8685555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8685668Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8685785Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8686175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8686281Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8686814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8686914Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8687293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8687535Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8687893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8687999Z kernel = self.compile( 2025-05-07T20:33:11.8688409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8688588Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8688729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8688736Z 2025-05-07T20:33:11.8688947Z self = 2025-05-07T20:33:11.8689816Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8690383Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f985783a520>} 2025-05-07T20:33:11.8691257Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8691464Z context = 2025-05-07T20:33:11.8691538Z 2025-05-07T20:33:11.8691713Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8691997Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8692109Z module_map=module_map) 2025-05-07T20:33:11.8692276Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8692394Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8692477Z E ^ 2025-05-07T20:33:11.8692848Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8692861Z 2025-05-07T20:33:11.8693306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8693310Z 2025-05-07T20:33:11.8693424Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8693663Z self=, 2025-05-07T20:33:11.8693749Z T=2048, 2025-05-07T20:33:11.8693831Z D=7168, 2025-05-07T20:33:11.8693932Z scale_ub=1200.0, 2025-05-07T20:33:11.8694021Z contiguous=False, 2025-05-07T20:33:11.8694110Z compiled=False, 2025-05-07T20:33:11.8694197Z ) 2025-05-07T20:33:11.8694568Z self = 2025-05-07T20:33:11.8694768Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:11.8694773Z 2025-05-07T20:33:11.8694862Z @given( 2025-05-07T20:33:11.8694982Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8695083Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8695213Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8695332Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8695455Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8695533Z ) 2025-05-07T20:33:11.8695803Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8695901Z def test_silu_mul_quant( 2025-05-07T20:33:11.8695982Z self, 2025-05-07T20:33:11.8696069Z T: int, 2025-05-07T20:33:11.8696149Z D: int, 2025-05-07T20:33:11.8696249Z scale_ub: Optional[float], 2025-05-07T20:33:11.8696347Z contiguous: bool, 2025-05-07T20:33:11.8696439Z compiled: bool, 2025-05-07T20:33:11.8696524Z ) -> None: 2025-05-07T20:33:11.8696622Z torch.manual_seed(2025) 2025-05-07T20:33:11.8696698Z 2025-05-07T20:33:11.8696880Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8696958Z 2025-05-07T20:33:11.8697058Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8697196Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8697287Z x = x_sign * x_clamp 2025-05-07T20:33:11.8697366Z x0 = x[:, :D] 2025-05-07T20:33:11.8697453Z x1 = x[:, D:] 2025-05-07T20:33:11.8697529Z 2025-05-07T20:33:11.8697613Z if contiguous: 2025-05-07T20:33:11.8697713Z x0 = x0.contiguous() 2025-05-07T20:33:11.8697805Z x1 = x1.contiguous() 2025-05-07T20:33:11.8697881Z 2025-05-07T20:33:11.8697981Z if scale_ub is not None: 2025-05-07T20:33:11.8698090Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8698233Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8698374Z ) 2025-05-07T20:33:11.8698453Z else: 2025-05-07T20:33:11.8698556Z scale_ub_tensor = None 2025-05-07T20:33:11.8698633Z 2025-05-07T20:33:11.8698769Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8698966Z op = silu_mul_quant 2025-05-07T20:33:11.8699071Z if compiled: 2025-05-07T20:33:11.8699192Z op = torch.compile(op) 2025-05-07T20:33:11.8699330Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8699458Z 2025-05-07T20:33:11.8699554Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8699567Z 2025-05-07T20:33:11.8699668Z moe/activation_test.py:117: 2025-05-07T20:33:11.8699799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8699911Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8700015Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8700543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8700648Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8701031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8701267Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8701625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8701725Z kernel = self.compile( 2025-05-07T20:33:11.8702134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8702312Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8702441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8702448Z 2025-05-07T20:33:11.8702662Z self = 2025-05-07T20:33:11.8703475Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8703994Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f985783bec0>} 2025-05-07T20:33:11.8704786Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8704985Z context = 2025-05-07T20:33:11.8704990Z 2025-05-07T20:33:11.8705159Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8705434Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8705548Z module_map=module_map) 2025-05-07T20:33:11.8705715Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8705822Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8705907Z E ^ 2025-05-07T20:33:11.8706276Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8706283Z 2025-05-07T20:33:11.8706727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8706731Z 2025-05-07T20:33:11.8706838Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8707069Z self=, 2025-05-07T20:33:11.8707157Z T=1, 2025-05-07T20:33:11.8707290Z D=7168, 2025-05-07T20:33:11.8707381Z scale_ub=None, 2025-05-07T20:33:11.8707477Z contiguous=True, 2025-05-07T20:33:11.8707568Z compiled=False, 2025-05-07T20:33:11.8707657Z ) 2025-05-07T20:33:11.8707886Z self = 2025-05-07T20:33:11.8708137Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:11.8708143Z 2025-05-07T20:33:11.8708235Z @given( 2025-05-07T20:33:11.8708359Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8708507Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8708635Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8708758Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8708876Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8708965Z ) 2025-05-07T20:33:11.8709220Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8709329Z def test_silu_mul_quant( 2025-05-07T20:33:11.8709412Z self, 2025-05-07T20:33:11.8709496Z T: int, 2025-05-07T20:33:11.8709588Z D: int, 2025-05-07T20:33:11.8709692Z scale_ub: Optional[float], 2025-05-07T20:33:11.8709787Z contiguous: bool, 2025-05-07T20:33:11.8709889Z compiled: bool, 2025-05-07T20:33:11.8709971Z ) -> None: 2025-05-07T20:33:11.8710069Z torch.manual_seed(2025) 2025-05-07T20:33:11.8710152Z 2025-05-07T20:33:11.8710322Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8710402Z 2025-05-07T20:33:11.8710505Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8710632Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8710733Z x = x_sign * x_clamp 2025-05-07T20:33:11.8710819Z x0 = x[:, :D] 2025-05-07T20:33:11.8710904Z x1 = x[:, D:] 2025-05-07T20:33:11.8710991Z 2025-05-07T20:33:11.8711082Z if contiguous: 2025-05-07T20:33:11.8711183Z x0 = x0.contiguous() 2025-05-07T20:33:11.8711287Z x1 = x1.contiguous() 2025-05-07T20:33:11.8711368Z 2025-05-07T20:33:11.8711464Z if scale_ub is not None: 2025-05-07T20:33:11.8711582Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8711727Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8711810Z ) 2025-05-07T20:33:11.8711900Z else: 2025-05-07T20:33:11.8712001Z scale_ub_tensor = None 2025-05-07T20:33:11.8712079Z 2025-05-07T20:33:11.8712221Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8712316Z op = silu_mul_quant 2025-05-07T20:33:11.8712413Z if compiled: 2025-05-07T20:33:11.8712516Z op = torch.compile(op) 2025-05-07T20:33:11.8712625Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8712706Z 2025-05-07T20:33:11.8712800Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8712806Z 2025-05-07T20:33:11.8712907Z moe/activation_test.py:117: 2025-05-07T20:33:11.8713049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8713153Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8713254Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8713792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8713893Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8714281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8714513Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8714870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8714973Z kernel = self.compile( 2025-05-07T20:33:11.8715434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8715617Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8715749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8715829Z 2025-05-07T20:33:11.8716040Z self = 2025-05-07T20:33:11.8716858Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8717414Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857c73240>} 2025-05-07T20:33:11.8718213Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8718410Z context = 2025-05-07T20:33:11.8718415Z 2025-05-07T20:33:11.8718592Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8718870Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8718981Z module_map=module_map) 2025-05-07T20:33:11.8719180Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8719293Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8719383Z E ^ 2025-05-07T20:33:11.8719759Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8719763Z 2025-05-07T20:33:11.8720199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8720207Z 2025-05-07T20:33:11.8720319Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8720547Z self=, 2025-05-07T20:33:11.8720628Z T=16384, 2025-05-07T20:33:11.8720720Z D=7168, 2025-05-07T20:33:11.8720806Z scale_ub=1200.0, 2025-05-07T20:33:11.8720894Z contiguous=False, 2025-05-07T20:33:11.8720986Z compiled=True, 2025-05-07T20:33:11.8721060Z ) 2025-05-07T20:33:11.8721284Z self = 2025-05-07T20:33:11.8721476Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:11.8721481Z 2025-05-07T20:33:11.8721561Z @given( 2025-05-07T20:33:11.8721688Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8721794Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8721911Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8722038Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8722153Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8722230Z ) 2025-05-07T20:33:11.8722492Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8722589Z def test_silu_mul_quant( 2025-05-07T20:33:11.8722674Z self, 2025-05-07T20:33:11.8722760Z T: int, 2025-05-07T20:33:11.8722838Z D: int, 2025-05-07T20:33:11.8722944Z scale_ub: Optional[float], 2025-05-07T20:33:11.8723039Z contiguous: bool, 2025-05-07T20:33:11.8723127Z compiled: bool, 2025-05-07T20:33:11.8723210Z ) -> None: 2025-05-07T20:33:11.8723306Z torch.manual_seed(2025) 2025-05-07T20:33:11.8723382Z 2025-05-07T20:33:11.8723559Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8723636Z 2025-05-07T20:33:11.8723729Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8723911Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8724006Z x = x_sign * x_clamp 2025-05-07T20:33:11.8724088Z x0 = x[:, :D] 2025-05-07T20:33:11.8724180Z x1 = x[:, D:] 2025-05-07T20:33:11.8724256Z 2025-05-07T20:33:11.8724448Z if contiguous: 2025-05-07T20:33:11.8724550Z x0 = x0.contiguous() 2025-05-07T20:33:11.8724642Z x1 = x1.contiguous() 2025-05-07T20:33:11.8724724Z 2025-05-07T20:33:11.8724815Z if scale_ub is not None: 2025-05-07T20:33:11.8724967Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8725108Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8725185Z ) 2025-05-07T20:33:11.8725266Z else: 2025-05-07T20:33:11.8725368Z scale_ub_tensor = None 2025-05-07T20:33:11.8725756Z 2025-05-07T20:33:11.8725945Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8726088Z op = silu_mul_quant 2025-05-07T20:33:11.8726189Z if compiled: 2025-05-07T20:33:11.8726292Z op = torch.compile(op) 2025-05-07T20:33:11.8726408Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8726487Z 2025-05-07T20:33:11.8726587Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8726599Z 2025-05-07T20:33:11.8726700Z moe/activation_test.py:117: 2025-05-07T20:33:11.8726834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8726945Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8727052Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8727438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8727540Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8728062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8728170Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8728543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8728769Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8729141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8729234Z kernel = self.compile( 2025-05-07T20:33:11.8729636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8729823Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8729954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8729958Z 2025-05-07T20:33:11.8730171Z self = 2025-05-07T20:33:11.8730981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8731498Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857c71620>} 2025-05-07T20:33:11.8732295Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8732491Z context = 2025-05-07T20:33:11.8732495Z 2025-05-07T20:33:11.8732669Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8732941Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8733235Z module_map=module_map) 2025-05-07T20:33:11.8733399Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8733501Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8733592Z E ^ 2025-05-07T20:33:11.8734093Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8734098Z 2025-05-07T20:33:11.8734616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8734692Z 2025-05-07T20:33:11.8734802Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8735029Z self=, 2025-05-07T20:33:11.8735116Z T=1, 2025-05-07T20:33:11.8735192Z D=7168, 2025-05-07T20:33:11.8735276Z scale_ub=None, 2025-05-07T20:33:11.8735367Z contiguous=False, 2025-05-07T20:33:11.8735453Z compiled=False, 2025-05-07T20:33:11.8735528Z ) 2025-05-07T20:33:11.8735759Z self = 2025-05-07T20:33:11.8735928Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:11.8735933Z 2025-05-07T20:33:11.8736018Z @given( 2025-05-07T20:33:11.8736142Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8736240Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8736358Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8736476Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8736586Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8736667Z ) 2025-05-07T20:33:11.8736915Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8737007Z def test_silu_mul_quant( 2025-05-07T20:33:11.8737089Z self, 2025-05-07T20:33:11.8737169Z T: int, 2025-05-07T20:33:11.8737246Z D: int, 2025-05-07T20:33:11.8737350Z scale_ub: Optional[float], 2025-05-07T20:33:11.8737437Z contiguous: bool, 2025-05-07T20:33:11.8737521Z compiled: bool, 2025-05-07T20:33:11.8737604Z ) -> None: 2025-05-07T20:33:11.8737697Z torch.manual_seed(2025) 2025-05-07T20:33:11.8737782Z 2025-05-07T20:33:11.8737954Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8738026Z 2025-05-07T20:33:11.8738129Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8738253Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8738343Z x = x_sign * x_clamp 2025-05-07T20:33:11.8738430Z x0 = x[:, :D] 2025-05-07T20:33:11.8738508Z x1 = x[:, D:] 2025-05-07T20:33:11.8738580Z 2025-05-07T20:33:11.8738666Z if contiguous: 2025-05-07T20:33:11.8738758Z x0 = x0.contiguous() 2025-05-07T20:33:11.8738847Z x1 = x1.contiguous() 2025-05-07T20:33:11.8738928Z 2025-05-07T20:33:11.8739020Z if scale_ub is not None: 2025-05-07T20:33:11.8739131Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8739265Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8739341Z ) 2025-05-07T20:33:11.8739431Z else: 2025-05-07T20:33:11.8739526Z scale_ub_tensor = None 2025-05-07T20:33:11.8739599Z 2025-05-07T20:33:11.8739738Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8739828Z op = silu_mul_quant 2025-05-07T20:33:11.8739916Z if compiled: 2025-05-07T20:33:11.8740021Z op = torch.compile(op) 2025-05-07T20:33:11.8740130Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8740204Z 2025-05-07T20:33:11.8740299Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8740304Z 2025-05-07T20:33:11.8740403Z moe/activation_test.py:117: 2025-05-07T20:33:11.8740541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8740691Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8740792Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8741404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8741505Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8741879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8742114Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8742511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8742612Z kernel = self.compile( 2025-05-07T20:33:11.8743012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8743191Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8743326Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8743330Z 2025-05-07T20:33:11.8743535Z self = 2025-05-07T20:33:11.8744358Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8744875Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857c73ba0>} 2025-05-07T20:33:11.8745663Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8745869Z context = 2025-05-07T20:33:11.8745873Z 2025-05-07T20:33:11.8746042Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8746323Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8746432Z module_map=module_map) 2025-05-07T20:33:11.8746594Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8746701Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8746781Z E ^ 2025-05-07T20:33:11.8747157Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8747162Z 2025-05-07T20:33:11.8747594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8747598Z 2025-05-07T20:33:11.8747705Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8747943Z self=, 2025-05-07T20:33:11.8748022Z T=2048, 2025-05-07T20:33:11.8748099Z D=7168, 2025-05-07T20:33:11.8748192Z scale_ub=None, 2025-05-07T20:33:11.8748278Z contiguous=False, 2025-05-07T20:33:11.8748372Z compiled=True, 2025-05-07T20:33:11.8748445Z ) 2025-05-07T20:33:11.8748669Z self = 2025-05-07T20:33:11.8748850Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:11.8748858Z 2025-05-07T20:33:11.8748932Z @given( 2025-05-07T20:33:11.8749050Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8749158Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8749272Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8749385Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8749554Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8749625Z ) 2025-05-07T20:33:11.8749880Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8749973Z def test_silu_mul_quant( 2025-05-07T20:33:11.8750050Z self, 2025-05-07T20:33:11.8750211Z T: int, 2025-05-07T20:33:11.8750292Z D: int, 2025-05-07T20:33:11.8750389Z scale_ub: Optional[float], 2025-05-07T20:33:11.8750485Z contiguous: bool, 2025-05-07T20:33:11.8750571Z compiled: bool, 2025-05-07T20:33:11.8750693Z ) -> None: 2025-05-07T20:33:11.8750790Z torch.manual_seed(2025) 2025-05-07T20:33:11.8750863Z 2025-05-07T20:33:11.8751031Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8751112Z 2025-05-07T20:33:11.8751205Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8751335Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8751427Z x = x_sign * x_clamp 2025-05-07T20:33:11.8751506Z x0 = x[:, :D] 2025-05-07T20:33:11.8751589Z x1 = x[:, D:] 2025-05-07T20:33:11.8751664Z 2025-05-07T20:33:11.8751744Z if contiguous: 2025-05-07T20:33:11.8751840Z x0 = x0.contiguous() 2025-05-07T20:33:11.8751929Z x1 = x1.contiguous() 2025-05-07T20:33:11.8752008Z 2025-05-07T20:33:11.8752104Z if scale_ub is not None: 2025-05-07T20:33:11.8752209Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8752343Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8752434Z ) 2025-05-07T20:33:11.8752511Z else: 2025-05-07T20:33:11.8752604Z scale_ub_tensor = None 2025-05-07T20:33:11.8752685Z 2025-05-07T20:33:11.8752814Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8752913Z op = silu_mul_quant 2025-05-07T20:33:11.8752999Z if compiled: 2025-05-07T20:33:11.8753102Z op = torch.compile(op) 2025-05-07T20:33:11.8753219Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8753294Z 2025-05-07T20:33:11.8753388Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8753392Z 2025-05-07T20:33:11.8753499Z moe/activation_test.py:117: 2025-05-07T20:33:11.8753636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8753741Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8753850Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8754235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8754338Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8754856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8754955Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8755335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8755564Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8755930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8756028Z kernel = self.compile( 2025-05-07T20:33:11.8756428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8756610Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8756755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8756760Z 2025-05-07T20:33:11.8756966Z self = 2025-05-07T20:33:11.8757780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8764510Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f99436e2ca0>} 2025-05-07T20:33:11.8765497Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8765742Z context = 2025-05-07T20:33:11.8765747Z 2025-05-07T20:33:11.8765921Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8766205Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8766316Z module_map=module_map) 2025-05-07T20:33:11.8766499Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8766604Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8766687Z E ^ 2025-05-07T20:33:11.8767069Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8767081Z 2025-05-07T20:33:11.8767520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8767525Z 2025-05-07T20:33:11.8767637Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8767884Z self=, 2025-05-07T20:33:11.8767971Z T=4096, 2025-05-07T20:33:11.8768066Z D=7168, 2025-05-07T20:33:11.8768155Z scale_ub=None, 2025-05-07T20:33:11.8768248Z contiguous=False, 2025-05-07T20:33:11.8768343Z compiled=True, 2025-05-07T20:33:11.8768426Z ) 2025-05-07T20:33:11.8768653Z self = 2025-05-07T20:33:11.8768847Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:11.8768852Z 2025-05-07T20:33:11.8768934Z @given( 2025-05-07T20:33:11.8769057Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8769175Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8769295Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8769427Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8769544Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8769626Z ) 2025-05-07T20:33:11.8769892Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8769996Z def test_silu_mul_quant( 2025-05-07T20:33:11.8770076Z self, 2025-05-07T20:33:11.8770162Z T: int, 2025-05-07T20:33:11.8770242Z D: int, 2025-05-07T20:33:11.8770345Z scale_ub: Optional[float], 2025-05-07T20:33:11.8770451Z contiguous: bool, 2025-05-07T20:33:11.8770543Z compiled: bool, 2025-05-07T20:33:11.8770626Z ) -> None: 2025-05-07T20:33:11.8770732Z torch.manual_seed(2025) 2025-05-07T20:33:11.8770810Z 2025-05-07T20:33:11.8770992Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8771074Z 2025-05-07T20:33:11.8771173Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8771309Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8771402Z x = x_sign * x_clamp 2025-05-07T20:33:11.8771488Z x0 = x[:, :D] 2025-05-07T20:33:11.8771579Z x1 = x[:, D:] 2025-05-07T20:33:11.8771654Z 2025-05-07T20:33:11.8771742Z if contiguous: 2025-05-07T20:33:11.8771844Z x0 = x0.contiguous() 2025-05-07T20:33:11.8771938Z x1 = x1.contiguous() 2025-05-07T20:33:11.8772017Z 2025-05-07T20:33:11.8772119Z if scale_ub is not None: 2025-05-07T20:33:11.8772228Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8772429Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8772509Z ) 2025-05-07T20:33:11.8772588Z else: 2025-05-07T20:33:11.8772692Z scale_ub_tensor = None 2025-05-07T20:33:11.8772768Z 2025-05-07T20:33:11.8772981Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8773083Z op = silu_mul_quant 2025-05-07T20:33:11.8773173Z if compiled: 2025-05-07T20:33:11.8773278Z op = torch.compile(op) 2025-05-07T20:33:11.8773437Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8773513Z 2025-05-07T20:33:11.8773610Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8773622Z 2025-05-07T20:33:11.8773723Z moe/activation_test.py:117: 2025-05-07T20:33:11.8773859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8773973Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8774079Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8774609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8774716Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8775244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8775346Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8775730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8775965Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8776332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8776431Z kernel = self.compile( 2025-05-07T20:33:11.8776836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8777026Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8777159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8777163Z 2025-05-07T20:33:11.8777384Z self = 2025-05-07T20:33:11.8778197Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8778717Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9942a4f240>} 2025-05-07T20:33:11.8779525Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8779724Z context = 2025-05-07T20:33:11.8779729Z 2025-05-07T20:33:11.8779912Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8780192Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8780301Z module_map=module_map) 2025-05-07T20:33:11.8780474Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8780578Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8780671Z E ^ 2025-05-07T20:33:11.8781043Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8781047Z 2025-05-07T20:33:11.8781486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8781540Z 2025-05-07T20:33:11.8781657Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8781886Z self=, 2025-05-07T20:33:11.8781982Z T=16384, 2025-05-07T20:33:11.8782067Z D=5120, 2025-05-07T20:33:11.8782235Z scale_ub=1200.0, 2025-05-07T20:33:11.8782336Z contiguous=False, 2025-05-07T20:33:11.8782425Z compiled=False, 2025-05-07T20:33:11.8782503Z ) 2025-05-07T20:33:11.8782739Z self = 2025-05-07T20:33:11.8782971Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:11.8782976Z 2025-05-07T20:33:11.8783058Z @given( 2025-05-07T20:33:11.8783191Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8783295Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8783414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8783546Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8783662Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8783751Z ) 2025-05-07T20:33:11.8784004Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8784107Z def test_silu_mul_quant( 2025-05-07T20:33:11.8784196Z self, 2025-05-07T20:33:11.8784279Z T: int, 2025-05-07T20:33:11.8784361Z D: int, 2025-05-07T20:33:11.8784471Z scale_ub: Optional[float], 2025-05-07T20:33:11.8784565Z contiguous: bool, 2025-05-07T20:33:11.8784657Z compiled: bool, 2025-05-07T20:33:11.8784745Z ) -> None: 2025-05-07T20:33:11.8784844Z torch.manual_seed(2025) 2025-05-07T20:33:11.8784919Z 2025-05-07T20:33:11.8785100Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8785177Z 2025-05-07T20:33:11.8785282Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8785411Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8785509Z x = x_sign * x_clamp 2025-05-07T20:33:11.8785603Z x0 = x[:, :D] 2025-05-07T20:33:11.8785690Z x1 = x[:, D:] 2025-05-07T20:33:11.8785770Z 2025-05-07T20:33:11.8785870Z if contiguous: 2025-05-07T20:33:11.8785974Z x0 = x0.contiguous() 2025-05-07T20:33:11.8786071Z x1 = x1.contiguous() 2025-05-07T20:33:11.8786162Z 2025-05-07T20:33:11.8786260Z if scale_ub is not None: 2025-05-07T20:33:11.8786373Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8786520Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8786598Z ) 2025-05-07T20:33:11.8786684Z else: 2025-05-07T20:33:11.8786783Z scale_ub_tensor = None 2025-05-07T20:33:11.8786858Z 2025-05-07T20:33:11.8786995Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8787089Z op = silu_mul_quant 2025-05-07T20:33:11.8787180Z if compiled: 2025-05-07T20:33:11.8787291Z op = torch.compile(op) 2025-05-07T20:33:11.8787398Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8787478Z 2025-05-07T20:33:11.8787578Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8787583Z 2025-05-07T20:33:11.8787687Z moe/activation_test.py:117: 2025-05-07T20:33:11.8787828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8787934Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8788034Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8788575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8788675Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8789055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8789292Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8789705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8789811Z kernel = self.compile( 2025-05-07T20:33:11.8790292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8790472Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8790613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8790658Z 2025-05-07T20:33:11.8790868Z self = 2025-05-07T20:33:11.8791685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8792204Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9942a4e840>} 2025-05-07T20:33:11.8792999Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8793205Z context = 2025-05-07T20:33:11.8793210Z 2025-05-07T20:33:11.8793385Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8793667Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8793780Z module_map=module_map) 2025-05-07T20:33:11.8793946Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8794060Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8794144Z E ^ 2025-05-07T20:33:11.8794516Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8794529Z 2025-05-07T20:33:11.8794969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8794974Z 2025-05-07T20:33:11.8795081Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8795322Z self=, 2025-05-07T20:33:11.8795411Z T=16384, 2025-05-07T20:33:11.8795494Z D=5120, 2025-05-07T20:33:11.8795591Z scale_ub=1200.0, 2025-05-07T20:33:11.8795680Z contiguous=True, 2025-05-07T20:33:11.8795766Z compiled=True, 2025-05-07T20:33:11.8795855Z ) 2025-05-07T20:33:11.8796083Z self = 2025-05-07T20:33:11.8796277Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:11.8796287Z 2025-05-07T20:33:11.8796371Z @given( 2025-05-07T20:33:11.8796493Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8796606Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8796723Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8796849Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8796973Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8797051Z ) 2025-05-07T20:33:11.8797305Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8797412Z def test_silu_mul_quant( 2025-05-07T20:33:11.8797492Z self, 2025-05-07T20:33:11.8797581Z T: int, 2025-05-07T20:33:11.8797664Z D: int, 2025-05-07T20:33:11.8797773Z scale_ub: Optional[float], 2025-05-07T20:33:11.8797874Z contiguous: bool, 2025-05-07T20:33:11.8797964Z compiled: bool, 2025-05-07T20:33:11.8798127Z ) -> None: 2025-05-07T20:33:11.8798236Z torch.manual_seed(2025) 2025-05-07T20:33:11.8798314Z 2025-05-07T20:33:11.8798489Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8798578Z 2025-05-07T20:33:11.8798676Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8798889Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8798999Z x = x_sign * x_clamp 2025-05-07T20:33:11.8799081Z x0 = x[:, :D] 2025-05-07T20:33:11.8799177Z x1 = x[:, D:] 2025-05-07T20:33:11.8799253Z 2025-05-07T20:33:11.8799381Z if contiguous: 2025-05-07T20:33:11.8799492Z x0 = x0.contiguous() 2025-05-07T20:33:11.8799589Z x1 = x1.contiguous() 2025-05-07T20:33:11.8799670Z 2025-05-07T20:33:11.8799776Z if scale_ub is not None: 2025-05-07T20:33:11.8799889Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8800030Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8800127Z ) 2025-05-07T20:33:11.8800212Z else: 2025-05-07T20:33:11.8800314Z scale_ub_tensor = None 2025-05-07T20:33:11.8800403Z 2025-05-07T20:33:11.8800537Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8800635Z op = silu_mul_quant 2025-05-07T20:33:11.8800742Z if compiled: 2025-05-07T20:33:11.8800850Z op = torch.compile(op) 2025-05-07T20:33:11.8800973Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8801054Z 2025-05-07T20:33:11.8801150Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8801158Z 2025-05-07T20:33:11.8801268Z moe/activation_test.py:117: 2025-05-07T20:33:11.8801404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8801513Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8801630Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8802018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8802128Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8802650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8802755Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8803136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8803367Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8803729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8803837Z kernel = self.compile( 2025-05-07T20:33:11.8804241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8804433Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8804566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8804570Z 2025-05-07T20:33:11.8804781Z self = 2025-05-07T20:33:11.8805602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8806118Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9943fc6ca0>} 2025-05-07T20:33:11.8806918Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8807171Z context = 2025-05-07T20:33:11.8807176Z 2025-05-07T20:33:11.8807357Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8807633Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8807823Z module_map=module_map) 2025-05-07T20:33:11.8808001Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8808108Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8808193Z E ^ 2025-05-07T20:33:11.8808618Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8808622Z 2025-05-07T20:33:11.8809058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8809062Z 2025-05-07T20:33:11.8809180Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8809418Z self=, 2025-05-07T20:33:11.8809501Z T=16384, 2025-05-07T20:33:11.8809591Z D=5120, 2025-05-07T20:33:11.8809677Z scale_ub=None, 2025-05-07T20:33:11.8809768Z contiguous=False, 2025-05-07T20:33:11.8809861Z compiled=True, 2025-05-07T20:33:11.8809943Z ) 2025-05-07T20:33:11.8810171Z self = 2025-05-07T20:33:11.8810361Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:11.8810369Z 2025-05-07T20:33:11.8810450Z @given( 2025-05-07T20:33:11.8810580Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8810683Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8810802Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8810930Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8811048Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8811128Z ) 2025-05-07T20:33:11.8811391Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8811491Z def test_silu_mul_quant( 2025-05-07T20:33:11.8811572Z self, 2025-05-07T20:33:11.8811661Z T: int, 2025-05-07T20:33:11.8811746Z D: int, 2025-05-07T20:33:11.8811857Z scale_ub: Optional[float], 2025-05-07T20:33:11.8811951Z contiguous: bool, 2025-05-07T20:33:11.8812040Z compiled: bool, 2025-05-07T20:33:11.8812134Z ) -> None: 2025-05-07T20:33:11.8812234Z torch.manual_seed(2025) 2025-05-07T20:33:11.8812313Z 2025-05-07T20:33:11.8812495Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8812574Z 2025-05-07T20:33:11.8812673Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8812809Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8812904Z x = x_sign * x_clamp 2025-05-07T20:33:11.8812992Z x0 = x[:, :D] 2025-05-07T20:33:11.8813088Z x1 = x[:, D:] 2025-05-07T20:33:11.8813165Z 2025-05-07T20:33:11.8813256Z if contiguous: 2025-05-07T20:33:11.8813361Z x0 = x0.contiguous() 2025-05-07T20:33:11.8813453Z x1 = x1.contiguous() 2025-05-07T20:33:11.8813538Z 2025-05-07T20:33:11.8813639Z if scale_ub is not None: 2025-05-07T20:33:11.8813750Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8813896Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8813976Z ) 2025-05-07T20:33:11.8814059Z else: 2025-05-07T20:33:11.8814165Z scale_ub_tensor = None 2025-05-07T20:33:11.8814241Z 2025-05-07T20:33:11.8814527Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8814634Z op = silu_mul_quant 2025-05-07T20:33:11.8814722Z if compiled: 2025-05-07T20:33:11.8814824Z op = torch.compile(op) 2025-05-07T20:33:11.8814993Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8815068Z 2025-05-07T20:33:11.8815171Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8815175Z 2025-05-07T20:33:11.8815276Z moe/activation_test.py:117: 2025-05-07T20:33:11.8815410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8815608Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8815714Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8816104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8816251Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8816776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8816885Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8817265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8817501Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8817869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8817968Z kernel = self.compile( 2025-05-07T20:33:11.8818377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8818566Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8818706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8818710Z 2025-05-07T20:33:11.8818926Z self = 2025-05-07T20:33:11.8819735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8820261Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857b74b80>} 2025-05-07T20:33:11.8821056Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8821255Z context = 2025-05-07T20:33:11.8821262Z 2025-05-07T20:33:11.8821441Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8821714Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8821831Z module_map=module_map) 2025-05-07T20:33:11.8821995Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8822102Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8822191Z E ^ 2025-05-07T20:33:11.8822562Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8822567Z 2025-05-07T20:33:11.8823010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8823021Z 2025-05-07T20:33:11.8823134Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8823362Z self=, 2025-05-07T20:33:11.8823455Z T=2048, 2025-05-07T20:33:11.8823536Z D=5120, 2025-05-07T20:33:11.8823619Z scale_ub=None, 2025-05-07T20:33:11.8823721Z contiguous=False, 2025-05-07T20:33:11.8823809Z compiled=True, 2025-05-07T20:33:11.8823886Z ) 2025-05-07T20:33:11.8824119Z self = 2025-05-07T20:33:11.8824349Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:11.8824354Z 2025-05-07T20:33:11.8824444Z @given( 2025-05-07T20:33:11.8824565Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8824673Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8824880Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8825002Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8825120Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8825207Z ) 2025-05-07T20:33:11.8826381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8826749Z def test_silu_mul_quant( 2025-05-07T20:33:11.8826842Z self, 2025-05-07T20:33:11.8826928Z T: int, 2025-05-07T20:33:11.8827011Z D: int, 2025-05-07T20:33:11.8827131Z scale_ub: Optional[float], 2025-05-07T20:33:11.8827228Z contiguous: bool, 2025-05-07T20:33:11.8827354Z compiled: bool, 2025-05-07T20:33:11.8827451Z ) -> None: 2025-05-07T20:33:11.8827553Z torch.manual_seed(2025) 2025-05-07T20:33:11.8827630Z 2025-05-07T20:33:11.8827829Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8827907Z 2025-05-07T20:33:11.8828011Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8828146Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8828254Z x = x_sign * x_clamp 2025-05-07T20:33:11.8828335Z x0 = x[:, :D] 2025-05-07T20:33:11.8828421Z x1 = x[:, D:] 2025-05-07T20:33:11.8828504Z 2025-05-07T20:33:11.8828588Z if contiguous: 2025-05-07T20:33:11.8828682Z x0 = x0.contiguous() 2025-05-07T20:33:11.8828779Z x1 = x1.contiguous() 2025-05-07T20:33:11.8828852Z 2025-05-07T20:33:11.8828942Z if scale_ub is not None: 2025-05-07T20:33:11.8829057Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8829197Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8829275Z ) 2025-05-07T20:33:11.8829360Z else: 2025-05-07T20:33:11.8829455Z scale_ub_tensor = None 2025-05-07T20:33:11.8829525Z 2025-05-07T20:33:11.8829669Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8829766Z op = silu_mul_quant 2025-05-07T20:33:11.8829856Z if compiled: 2025-05-07T20:33:11.8829959Z op = torch.compile(op) 2025-05-07T20:33:11.8830065Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8830151Z 2025-05-07T20:33:11.8830242Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8830248Z 2025-05-07T20:33:11.8830346Z moe/activation_test.py:117: 2025-05-07T20:33:11.8830490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8830592Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8830697Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8831102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8831198Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8831727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8831828Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8832204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8832443Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8832800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8832902Z kernel = self.compile( 2025-05-07T20:33:11.8833304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8833789Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8833928Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8833934Z 2025-05-07T20:33:11.8834139Z self = 2025-05-07T20:33:11.8835111Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8835707Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857b760c0>} 2025-05-07T20:33:11.8836504Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8836709Z context = 2025-05-07T20:33:11.8836714Z 2025-05-07T20:33:11.8836882Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8837166Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8837272Z module_map=module_map) 2025-05-07T20:33:11.8837436Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8837542Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8837622Z E ^ 2025-05-07T20:33:11.8837995Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8838007Z 2025-05-07T20:33:11.8838444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8838450Z 2025-05-07T20:33:11.8838556Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8838792Z self=, 2025-05-07T20:33:11.8838875Z T=2048, 2025-05-07T20:33:11.8838953Z D=5120, 2025-05-07T20:33:11.8839044Z scale_ub=1200.0, 2025-05-07T20:33:11.8839137Z contiguous=False, 2025-05-07T20:33:11.8839223Z compiled=True, 2025-05-07T20:33:11.8839323Z ) 2025-05-07T20:33:11.8839640Z self = 2025-05-07T20:33:11.8839865Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:11.8839874Z 2025-05-07T20:33:11.8839953Z @given( 2025-05-07T20:33:11.8840076Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8840187Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8840304Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8840423Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8840549Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8840630Z ) 2025-05-07T20:33:11.8840885Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8840989Z def test_silu_mul_quant( 2025-05-07T20:33:11.8841068Z self, 2025-05-07T20:33:11.8841160Z T: int, 2025-05-07T20:33:11.8841241Z D: int, 2025-05-07T20:33:11.8841344Z scale_ub: Optional[float], 2025-05-07T20:33:11.8841443Z contiguous: bool, 2025-05-07T20:33:11.8841531Z compiled: bool, 2025-05-07T20:33:11.8841613Z ) -> None: 2025-05-07T20:33:11.8841718Z torch.manual_seed(2025) 2025-05-07T20:33:11.8841793Z 2025-05-07T20:33:11.8841966Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8842054Z 2025-05-07T20:33:11.8842151Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8842277Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8842437Z x = x_sign * x_clamp 2025-05-07T20:33:11.8842519Z x0 = x[:, :D] 2025-05-07T20:33:11.8842609Z x1 = x[:, D:] 2025-05-07T20:33:11.8842684Z 2025-05-07T20:33:11.8842773Z if contiguous: 2025-05-07T20:33:11.8842873Z x0 = x0.contiguous() 2025-05-07T20:33:11.8843042Z x1 = x1.contiguous() 2025-05-07T20:33:11.8843120Z 2025-05-07T20:33:11.8843224Z if scale_ub is not None: 2025-05-07T20:33:11.8843332Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8843469Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8843601Z ) 2025-05-07T20:33:11.8843679Z else: 2025-05-07T20:33:11.8843776Z scale_ub_tensor = None 2025-05-07T20:33:11.8843858Z 2025-05-07T20:33:11.8843989Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8844078Z op = silu_mul_quant 2025-05-07T20:33:11.8844168Z if compiled: 2025-05-07T20:33:11.8844271Z op = torch.compile(op) 2025-05-07T20:33:11.8844385Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8844462Z 2025-05-07T20:33:11.8844554Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8844558Z 2025-05-07T20:33:11.8844660Z moe/activation_test.py:117: 2025-05-07T20:33:11.8844796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8844897Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8845009Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8845394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8845496Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8846016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8846114Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8846494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8846725Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8847086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8847188Z kernel = self.compile( 2025-05-07T20:33:11.8847590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8847772Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8847905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8847910Z 2025-05-07T20:33:11.8848117Z self = 2025-05-07T20:33:11.8848937Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8849451Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9857b772e0>} 2025-05-07T20:33:11.8850256Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8850450Z context = 2025-05-07T20:33:11.8850455Z 2025-05-07T20:33:11.8850630Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8850902Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8851008Z module_map=module_map) 2025-05-07T20:33:11.8851227Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8851326Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8851401Z E ^ 2025-05-07T20:33:11.8851779Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8851861Z 2025-05-07T20:33:11.8852300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8852305Z 2025-05-07T20:33:11.8852417Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8852684Z self=, 2025-05-07T20:33:11.8852765Z T=4096, 2025-05-07T20:33:11.8852851Z D=5120, 2025-05-07T20:33:11.8852940Z scale_ub=1200.0, 2025-05-07T20:33:11.8853026Z contiguous=True, 2025-05-07T20:33:11.8853121Z compiled=True, 2025-05-07T20:33:11.8853198Z ) 2025-05-07T20:33:11.8853426Z self = 2025-05-07T20:33:11.8853614Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:11.8853619Z 2025-05-07T20:33:11.8853698Z @given( 2025-05-07T20:33:11.8853824Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8853934Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8854053Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8854181Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8854296Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8854482Z ) 2025-05-07T20:33:11.8854746Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8854843Z def test_silu_mul_quant( 2025-05-07T20:33:11.8854924Z self, 2025-05-07T20:33:11.8855014Z T: int, 2025-05-07T20:33:11.8855093Z D: int, 2025-05-07T20:33:11.8855206Z scale_ub: Optional[float], 2025-05-07T20:33:11.8855303Z contiguous: bool, 2025-05-07T20:33:11.8855392Z compiled: bool, 2025-05-07T20:33:11.8855478Z ) -> None: 2025-05-07T20:33:11.8855572Z torch.manual_seed(2025) 2025-05-07T20:33:11.8855644Z 2025-05-07T20:33:11.8855834Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8855910Z 2025-05-07T20:33:11.8856006Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8856142Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8856230Z x = x_sign * x_clamp 2025-05-07T20:33:11.8856317Z x0 = x[:, :D] 2025-05-07T20:33:11.8856404Z x1 = x[:, D:] 2025-05-07T20:33:11.8856477Z 2025-05-07T20:33:11.8856572Z if contiguous: 2025-05-07T20:33:11.8856664Z x0 = x0.contiguous() 2025-05-07T20:33:11.8856753Z x1 = x1.contiguous() 2025-05-07T20:33:11.8856836Z 2025-05-07T20:33:11.8856928Z if scale_ub is not None: 2025-05-07T20:33:11.8857038Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8857184Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8857262Z ) 2025-05-07T20:33:11.8857341Z else: 2025-05-07T20:33:11.8857446Z scale_ub_tensor = None 2025-05-07T20:33:11.8857521Z 2025-05-07T20:33:11.8857657Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8857759Z op = silu_mul_quant 2025-05-07T20:33:11.8857847Z if compiled: 2025-05-07T20:33:11.8857948Z op = torch.compile(op) 2025-05-07T20:33:11.8858067Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8858143Z 2025-05-07T20:33:11.8858242Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8858247Z 2025-05-07T20:33:11.8858343Z moe/activation_test.py:117: 2025-05-07T20:33:11.8858478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8858588Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8858746Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8859137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8859242Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8859933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8860046Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8860424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8860694Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8861060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8861158Z kernel = self.compile( 2025-05-07T20:33:11.8861564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8861758Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8861892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8861896Z 2025-05-07T20:33:11.8862122Z self = 2025-05-07T20:33:11.8862935Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8863459Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98570dc860>} 2025-05-07T20:33:11.8864252Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8864451Z context = 2025-05-07T20:33:11.8864455Z 2025-05-07T20:33:11.8864634Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8864914Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8865032Z module_map=module_map) 2025-05-07T20:33:11.8865197Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8865302Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8865384Z E ^ 2025-05-07T20:33:11.8865757Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8865762Z 2025-05-07T20:33:11.8866201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8866214Z 2025-05-07T20:33:11.8866323Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8866559Z self=, 2025-05-07T20:33:11.8866651Z T=128, 2025-05-07T20:33:11.8866735Z D=5120, 2025-05-07T20:33:11.8866832Z scale_ub=1200.0, 2025-05-07T20:33:11.8866933Z contiguous=False, 2025-05-07T20:33:11.8867020Z compiled=True, 2025-05-07T20:33:11.8867101Z ) 2025-05-07T20:33:11.8867336Z self = 2025-05-07T20:33:11.8867518Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:11.8867522Z 2025-05-07T20:33:11.8867610Z @given( 2025-05-07T20:33:11.8867732Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8867839Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8867967Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8868136Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8868255Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8868341Z ) 2025-05-07T20:33:11.8868597Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8868768Z def test_silu_mul_quant( 2025-05-07T20:33:11.8868860Z self, 2025-05-07T20:33:11.8868943Z T: int, 2025-05-07T20:33:11.8869026Z D: int, 2025-05-07T20:33:11.8869138Z scale_ub: Optional[float], 2025-05-07T20:33:11.8869231Z contiguous: bool, 2025-05-07T20:33:11.8869370Z compiled: bool, 2025-05-07T20:33:11.8869450Z ) -> None: 2025-05-07T20:33:11.8869547Z torch.manual_seed(2025) 2025-05-07T20:33:11.8869630Z 2025-05-07T20:33:11.8869802Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8869879Z 2025-05-07T20:33:11.8869979Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8870110Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8870201Z x = x_sign * x_clamp 2025-05-07T20:33:11.8870291Z x0 = x[:, :D] 2025-05-07T20:33:11.8870370Z x1 = x[:, D:] 2025-05-07T20:33:11.8870445Z 2025-05-07T20:33:11.8870535Z if contiguous: 2025-05-07T20:33:11.8870633Z x0 = x0.contiguous() 2025-05-07T20:33:11.8870731Z x1 = x1.contiguous() 2025-05-07T20:33:11.8870804Z 2025-05-07T20:33:11.8870895Z if scale_ub is not None: 2025-05-07T20:33:11.8871009Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8871152Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8871231Z ) 2025-05-07T20:33:11.8871318Z else: 2025-05-07T20:33:11.8871417Z scale_ub_tensor = None 2025-05-07T20:33:11.8871487Z 2025-05-07T20:33:11.8871625Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8871718Z op = silu_mul_quant 2025-05-07T20:33:11.8871806Z if compiled: 2025-05-07T20:33:11.8871914Z op = torch.compile(op) 2025-05-07T20:33:11.8872022Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8872100Z 2025-05-07T20:33:11.8872198Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8872202Z 2025-05-07T20:33:11.8872308Z moe/activation_test.py:117: 2025-05-07T20:33:11.8872450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8872553Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8872657Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8873060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8873155Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8873678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8873789Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8874168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8874405Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8874768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8874865Z kernel = self.compile( 2025-05-07T20:33:11.8875277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8875459Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8875603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8875608Z 2025-05-07T20:33:11.8875818Z self = 2025-05-07T20:33:11.8876629Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8877281Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98570dd580>} 2025-05-07T20:33:11.8878079Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8878324Z context = 2025-05-07T20:33:11.8878328Z 2025-05-07T20:33:11.8878502Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8878783Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8878905Z module_map=module_map) 2025-05-07T20:33:11.8879081Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8879194Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8879278Z E ^ 2025-05-07T20:33:11.8879660Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8879665Z 2025-05-07T20:33:11.8880110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8880118Z 2025-05-07T20:33:11.8880228Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8880469Z self=, 2025-05-07T20:33:11.8880554Z T=16384, 2025-05-07T20:33:11.8880637Z D=7168, 2025-05-07T20:33:11.8880731Z scale_ub=1200.0, 2025-05-07T20:33:11.8880819Z contiguous=True, 2025-05-07T20:33:11.8880908Z compiled=True, 2025-05-07T20:33:11.8880999Z ) 2025-05-07T20:33:11.8881228Z self = 2025-05-07T20:33:11.8881417Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:11.8881422Z 2025-05-07T20:33:11.8881514Z @given( 2025-05-07T20:33:11.8881638Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8881747Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8881867Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8881985Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8882113Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8882187Z ) 2025-05-07T20:33:11.8882441Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8882543Z def test_silu_mul_quant( 2025-05-07T20:33:11.8882620Z self, 2025-05-07T20:33:11.8882703Z T: int, 2025-05-07T20:33:11.8882793Z D: int, 2025-05-07T20:33:11.8882895Z scale_ub: Optional[float], 2025-05-07T20:33:11.8882989Z contiguous: bool, 2025-05-07T20:33:11.8883083Z compiled: bool, 2025-05-07T20:33:11.8883164Z ) -> None: 2025-05-07T20:33:11.8883265Z torch.manual_seed(2025) 2025-05-07T20:33:11.8883337Z 2025-05-07T20:33:11.8883513Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8883595Z 2025-05-07T20:33:11.8883690Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8883814Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8883912Z x = x_sign * x_clamp 2025-05-07T20:33:11.8883993Z x0 = x[:, :D] 2025-05-07T20:33:11.8884074Z x1 = x[:, D:] 2025-05-07T20:33:11.8884157Z 2025-05-07T20:33:11.8884240Z if contiguous: 2025-05-07T20:33:11.8884329Z x0 = x0.contiguous() 2025-05-07T20:33:11.8884425Z x1 = x1.contiguous() 2025-05-07T20:33:11.8884499Z 2025-05-07T20:33:11.8884640Z if scale_ub is not None: 2025-05-07T20:33:11.8884758Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8884894Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8884978Z ) 2025-05-07T20:33:11.8885057Z else: 2025-05-07T20:33:11.8885231Z scale_ub_tensor = None 2025-05-07T20:33:11.8885313Z 2025-05-07T20:33:11.8885447Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8885539Z op = silu_mul_quant 2025-05-07T20:33:11.8885631Z if compiled: 2025-05-07T20:33:11.8885772Z op = torch.compile(op) 2025-05-07T20:33:11.8885878Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8885960Z 2025-05-07T20:33:11.8886056Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8886061Z 2025-05-07T20:33:11.8886166Z moe/activation_test.py:117: 2025-05-07T20:33:11.8886298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8886403Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8886510Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8886903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8886998Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8887531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8887630Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8888017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8888246Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8888604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8888713Z kernel = self.compile( 2025-05-07T20:33:11.8889119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8889298Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8889440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8889444Z 2025-05-07T20:33:11.8889682Z self = 2025-05-07T20:33:11.8890518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8891035Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98570de0c0>} 2025-05-07T20:33:11.8891835Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8892031Z context = 2025-05-07T20:33:11.8892035Z 2025-05-07T20:33:11.8892212Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8892492Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8892600Z module_map=module_map) 2025-05-07T20:33:11.8892781Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8892881Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8892962Z E ^ 2025-05-07T20:33:11.8899861Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8899870Z 2025-05-07T20:33:11.8900340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8900418Z 2025-05-07T20:33:11.8900530Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8900775Z self=, 2025-05-07T20:33:11.8900963Z T=16384, 2025-05-07T20:33:11.8901051Z D=5120, 2025-05-07T20:33:11.8901152Z scale_ub=1200.0, 2025-05-07T20:33:11.8901243Z contiguous=True, 2025-05-07T20:33:11.8901334Z compiled=False, 2025-05-07T20:33:11.8901425Z ) 2025-05-07T20:33:11.8901698Z self = 2025-05-07T20:33:11.8901895Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:11.8901900Z 2025-05-07T20:33:11.8901986Z @given( 2025-05-07T20:33:11.8902109Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8902228Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8902352Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8902477Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8902602Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8902684Z ) 2025-05-07T20:33:11.8902947Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8903055Z def test_silu_mul_quant( 2025-05-07T20:33:11.8903138Z self, 2025-05-07T20:33:11.8903230Z T: int, 2025-05-07T20:33:11.8903316Z D: int, 2025-05-07T20:33:11.8903419Z scale_ub: Optional[float], 2025-05-07T20:33:11.8903522Z contiguous: bool, 2025-05-07T20:33:11.8903613Z compiled: bool, 2025-05-07T20:33:11.8903696Z ) -> None: 2025-05-07T20:33:11.8903809Z torch.manual_seed(2025) 2025-05-07T20:33:11.8903891Z 2025-05-07T20:33:11.8904069Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8904161Z 2025-05-07T20:33:11.8904265Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8904396Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8904498Z x = x_sign * x_clamp 2025-05-07T20:33:11.8904587Z x0 = x[:, :D] 2025-05-07T20:33:11.8904674Z x1 = x[:, D:] 2025-05-07T20:33:11.8904761Z 2025-05-07T20:33:11.8904857Z if contiguous: 2025-05-07T20:33:11.8904964Z x0 = x0.contiguous() 2025-05-07T20:33:11.8905061Z x1 = x1.contiguous() 2025-05-07T20:33:11.8905141Z 2025-05-07T20:33:11.8905245Z if scale_ub is not None: 2025-05-07T20:33:11.8905360Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8905500Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8905588Z ) 2025-05-07T20:33:11.8905670Z else: 2025-05-07T20:33:11.8905771Z scale_ub_tensor = None 2025-05-07T20:33:11.8905859Z 2025-05-07T20:33:11.8905992Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8906089Z op = silu_mul_quant 2025-05-07T20:33:11.8906186Z if compiled: 2025-05-07T20:33:11.8906289Z op = torch.compile(op) 2025-05-07T20:33:11.8906405Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8906482Z 2025-05-07T20:33:11.8906582Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8906586Z 2025-05-07T20:33:11.8906693Z moe/activation_test.py:117: 2025-05-07T20:33:11.8906824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8906928Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8907039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8907564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8907674Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8908052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8908334Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8908701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8908876Z kernel = self.compile( 2025-05-07T20:33:11.8909286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8909478Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8909652Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8909656Z 2025-05-07T20:33:11.8909878Z self = 2025-05-07T20:33:11.8910693Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8911216Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98570df1a0>} 2025-05-07T20:33:11.8914634Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8914849Z context = 2025-05-07T20:33:11.8914858Z 2025-05-07T20:33:11.8915037Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8915312Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8915424Z module_map=module_map) 2025-05-07T20:33:11.8915597Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8915703Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8915781Z E ^ 2025-05-07T20:33:11.8916166Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8916171Z 2025-05-07T20:33:11.8916616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8916620Z 2025-05-07T20:33:11.8916755Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8916988Z self=, 2025-05-07T20:33:11.8917074Z T=1, 2025-05-07T20:33:11.8917165Z D=7168, 2025-05-07T20:33:11.8917250Z scale_ub=1200.0, 2025-05-07T20:33:11.8917336Z contiguous=False, 2025-05-07T20:33:11.8917432Z compiled=False, 2025-05-07T20:33:11.8917508Z ) 2025-05-07T20:33:11.8917735Z self = 2025-05-07T20:33:11.8917921Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:11.8917926Z 2025-05-07T20:33:11.8918007Z @given( 2025-05-07T20:33:11.8918138Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8918241Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8918364Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8918491Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8918608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8918690Z ) 2025-05-07T20:33:11.8918955Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8919051Z def test_silu_mul_quant( 2025-05-07T20:33:11.8919130Z self, 2025-05-07T20:33:11.8919222Z T: int, 2025-05-07T20:33:11.8919301Z D: int, 2025-05-07T20:33:11.8919401Z scale_ub: Optional[float], 2025-05-07T20:33:11.8919499Z contiguous: bool, 2025-05-07T20:33:11.8919650Z compiled: bool, 2025-05-07T20:33:11.8919737Z ) -> None: 2025-05-07T20:33:11.8919841Z torch.manual_seed(2025) 2025-05-07T20:33:11.8919921Z 2025-05-07T20:33:11.8920110Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8920193Z 2025-05-07T20:33:11.8920333Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8920474Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8920569Z x = x_sign * x_clamp 2025-05-07T20:33:11.8920656Z x0 = x[:, :D] 2025-05-07T20:33:11.8920795Z x1 = x[:, D:] 2025-05-07T20:33:11.8920874Z 2025-05-07T20:33:11.8920966Z if contiguous: 2025-05-07T20:33:11.8921071Z x0 = x0.contiguous() 2025-05-07T20:33:11.8921166Z x1 = x1.contiguous() 2025-05-07T20:33:11.8921249Z 2025-05-07T20:33:11.8921347Z if scale_ub is not None: 2025-05-07T20:33:11.8921463Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8921616Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8921697Z ) 2025-05-07T20:33:11.8921783Z else: 2025-05-07T20:33:11.8921890Z scale_ub_tensor = None 2025-05-07T20:33:11.8921971Z 2025-05-07T20:33:11.8922108Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8922210Z op = silu_mul_quant 2025-05-07T20:33:11.8922295Z if compiled: 2025-05-07T20:33:11.8922480Z op = torch.compile(op) 2025-05-07T20:33:11.8922598Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8922678Z 2025-05-07T20:33:11.8922778Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8922782Z 2025-05-07T20:33:11.8922884Z moe/activation_test.py:117: 2025-05-07T20:33:11.8923017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8923126Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8923227Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8923760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8923867Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8924249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8924489Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8924853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8924951Z kernel = self.compile( 2025-05-07T20:33:11.8925371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8926342Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8926479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8926494Z 2025-05-07T20:33:11.8926707Z self = 2025-05-07T20:33:11.8927523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8928052Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9856c30680>} 2025-05-07T20:33:11.8928854Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8929060Z context = 2025-05-07T20:33:11.8929064Z 2025-05-07T20:33:11.8929335Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8929612Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8929733Z module_map=module_map) 2025-05-07T20:33:11.8929966Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8930070Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8930163Z E ^ 2025-05-07T20:33:11.8930543Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8930605Z 2025-05-07T20:33:11.8931059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8931064Z 2025-05-07T20:33:11.8931173Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8931408Z self=, 2025-05-07T20:33:11.8931507Z T=4096, 2025-05-07T20:33:11.8931590Z D=7168, 2025-05-07T20:33:11.8931692Z scale_ub=1200.0, 2025-05-07T20:33:11.8931788Z contiguous=False, 2025-05-07T20:33:11.8931877Z compiled=True, 2025-05-07T20:33:11.8931966Z ) 2025-05-07T20:33:11.8932199Z self = 2025-05-07T20:33:11.8932388Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:11.8932393Z 2025-05-07T20:33:11.8932488Z @given( 2025-05-07T20:33:11.8932712Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8932823Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8932956Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8933079Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8933210Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8933293Z ) 2025-05-07T20:33:11.8933553Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8933664Z def test_silu_mul_quant( 2025-05-07T20:33:11.8933749Z self, 2025-05-07T20:33:11.8933831Z T: int, 2025-05-07T20:33:11.8933920Z D: int, 2025-05-07T20:33:11.8934025Z scale_ub: Optional[float], 2025-05-07T20:33:11.8934124Z contiguous: bool, 2025-05-07T20:33:11.8934225Z compiled: bool, 2025-05-07T20:33:11.8934314Z ) -> None: 2025-05-07T20:33:11.8934491Z torch.manual_seed(2025) 2025-05-07T20:33:11.8934583Z 2025-05-07T20:33:11.8934759Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8934851Z 2025-05-07T20:33:11.8934948Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8935080Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8935183Z x = x_sign * x_clamp 2025-05-07T20:33:11.8935270Z x0 = x[:, :D] 2025-05-07T20:33:11.8935357Z x1 = x[:, D:] 2025-05-07T20:33:11.8935444Z 2025-05-07T20:33:11.8935538Z if contiguous: 2025-05-07T20:33:11.8935638Z x0 = x0.contiguous() 2025-05-07T20:33:11.8935748Z x1 = x1.contiguous() 2025-05-07T20:33:11.8935826Z 2025-05-07T20:33:11.8935926Z if scale_ub is not None: 2025-05-07T20:33:11.8936051Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8936197Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8936277Z ) 2025-05-07T20:33:11.8936368Z else: 2025-05-07T20:33:11.8936472Z scale_ub_tensor = None 2025-05-07T20:33:11.8936559Z 2025-05-07T20:33:11.8936694Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8936790Z op = silu_mul_quant 2025-05-07T20:33:11.8936887Z if compiled: 2025-05-07T20:33:11.8936995Z op = torch.compile(op) 2025-05-07T20:33:11.8937109Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8937197Z 2025-05-07T20:33:11.8937346Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8937351Z 2025-05-07T20:33:11.8937453Z moe/activation_test.py:117: 2025-05-07T20:33:11.8937602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8937709Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8937865Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8938265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8938369Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8938906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8939050Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8939432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8939676Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8940044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8940154Z kernel = self.compile( 2025-05-07T20:33:11.8940566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8940750Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8940941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8940945Z 2025-05-07T20:33:11.8941163Z self = 2025-05-07T20:33:11.8941984Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8942501Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9856c31940>} 2025-05-07T20:33:11.8943303Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8943512Z context = 2025-05-07T20:33:11.8943517Z 2025-05-07T20:33:11.8943695Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8943986Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8944098Z module_map=module_map) 2025-05-07T20:33:11.8944265Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8944381Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8944464Z E ^ 2025-05-07T20:33:11.8944850Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8944855Z 2025-05-07T20:33:11.8945296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8945303Z 2025-05-07T20:33:11.8945413Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8945658Z self=, 2025-05-07T20:33:11.8945743Z T=128, 2025-05-07T20:33:11.8945830Z D=7168, 2025-05-07T20:33:11.8945930Z scale_ub=1200.0, 2025-05-07T20:33:11.8946022Z contiguous=False, 2025-05-07T20:33:11.8946119Z compiled=True, 2025-05-07T20:33:11.8946197Z ) 2025-05-07T20:33:11.8946428Z self = 2025-05-07T20:33:11.8946618Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:11.8946671Z 2025-05-07T20:33:11.8946758Z @given( 2025-05-07T20:33:11.8946884Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8947002Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8947123Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8947286Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8947416Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8947495Z ) 2025-05-07T20:33:11.8947766Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8947905Z def test_silu_mul_quant( 2025-05-07T20:33:11.8947990Z self, 2025-05-07T20:33:11.8948082Z T: int, 2025-05-07T20:33:11.8948164Z D: int, 2025-05-07T20:33:11.8948268Z scale_ub: Optional[float], 2025-05-07T20:33:11.8948371Z contiguous: bool, 2025-05-07T20:33:11.8948464Z compiled: bool, 2025-05-07T20:33:11.8948548Z ) -> None: 2025-05-07T20:33:11.8948663Z torch.manual_seed(2025) 2025-05-07T20:33:11.8948745Z 2025-05-07T20:33:11.8948925Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8949014Z 2025-05-07T20:33:11.8949114Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8949254Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8949350Z x = x_sign * x_clamp 2025-05-07T20:33:11.8949434Z x0 = x[:, :D] 2025-05-07T20:33:11.8949526Z x1 = x[:, D:] 2025-05-07T20:33:11.8949653Z 2025-05-07T20:33:11.8949748Z if contiguous: 2025-05-07T20:33:11.8949858Z x0 = x0.contiguous() 2025-05-07T20:33:11.8949954Z x1 = x1.contiguous() 2025-05-07T20:33:11.8950033Z 2025-05-07T20:33:11.8950140Z if scale_ub is not None: 2025-05-07T20:33:11.8950254Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8950395Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8950488Z ) 2025-05-07T20:33:11.8950574Z else: 2025-05-07T20:33:11.8950677Z scale_ub_tensor = None 2025-05-07T20:33:11.8950767Z 2025-05-07T20:33:11.8950903Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8951008Z op = silu_mul_quant 2025-05-07T20:33:11.8951098Z if compiled: 2025-05-07T20:33:11.8951208Z op = torch.compile(op) 2025-05-07T20:33:11.8951326Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8951405Z 2025-05-07T20:33:11.8951504Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8951511Z 2025-05-07T20:33:11.8951621Z moe/activation_test.py:117: 2025-05-07T20:33:11.8951761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8951866Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8951981Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8952373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8952484Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8953010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8953114Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8953513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8953750Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8954126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8954227Z kernel = self.compile( 2025-05-07T20:33:11.8954635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8954827Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8955013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8955018Z 2025-05-07T20:33:11.8955234Z self = 2025-05-07T20:33:11.8956095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8956618Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9856c32700>} 2025-05-07T20:33:11.8957468Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8957671Z context = 2025-05-07T20:33:11.8957679Z 2025-05-07T20:33:11.8957862Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8958141Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8958255Z module_map=module_map) 2025-05-07T20:33:11.8958439Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8958543Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8958672Z E ^ 2025-05-07T20:33:11.8959056Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8959064Z 2025-05-07T20:33:11.8959505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8959510Z 2025-05-07T20:33:11.8959630Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8959865Z self=, 2025-05-07T20:33:11.8959954Z T=2048, 2025-05-07T20:33:11.8960048Z D=7168, 2025-05-07T20:33:11.8960140Z scale_ub=None, 2025-05-07T20:33:11.8960231Z contiguous=True, 2025-05-07T20:33:11.8960330Z compiled=True, 2025-05-07T20:33:11.8960411Z ) 2025-05-07T20:33:11.8960644Z self = 2025-05-07T20:33:11.8960833Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:11.8960840Z 2025-05-07T20:33:11.8960927Z @given( 2025-05-07T20:33:11.8961069Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8961177Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8961300Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8961433Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8961551Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8961634Z ) 2025-05-07T20:33:11.8961901Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8962007Z def test_silu_mul_quant( 2025-05-07T20:33:11.8962101Z self, 2025-05-07T20:33:11.8962185Z T: int, 2025-05-07T20:33:11.8962270Z D: int, 2025-05-07T20:33:11.8962388Z scale_ub: Optional[float], 2025-05-07T20:33:11.8962487Z contiguous: bool, 2025-05-07T20:33:11.8962581Z compiled: bool, 2025-05-07T20:33:11.8962674Z ) -> None: 2025-05-07T20:33:11.8962782Z torch.manual_seed(2025) 2025-05-07T20:33:11.8962861Z 2025-05-07T20:33:11.8963043Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8963121Z 2025-05-07T20:33:11.8963223Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8963357Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8963448Z x = x_sign * x_clamp 2025-05-07T20:33:11.8963540Z x0 = x[:, :D] 2025-05-07T20:33:11.8963679Z x1 = x[:, D:] 2025-05-07T20:33:11.8963755Z 2025-05-07T20:33:11.8963845Z if contiguous: 2025-05-07T20:33:11.8963949Z x0 = x0.contiguous() 2025-05-07T20:33:11.8964041Z x1 = x1.contiguous() 2025-05-07T20:33:11.8964116Z 2025-05-07T20:33:11.8964213Z if scale_ub is not None: 2025-05-07T20:33:11.8964356Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.8964505Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.8964587Z ) 2025-05-07T20:33:11.8964674Z else: 2025-05-07T20:33:11.8964834Z scale_ub_tensor = None 2025-05-07T20:33:11.8964911Z 2025-05-07T20:33:11.8965049Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.8965146Z op = silu_mul_quant 2025-05-07T20:33:11.8965234Z if compiled: 2025-05-07T20:33:11.8965344Z op = torch.compile(op) 2025-05-07T20:33:11.8965453Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8965533Z 2025-05-07T20:33:11.8965632Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.8965636Z 2025-05-07T20:33:11.8965735Z moe/activation_test.py:117: 2025-05-07T20:33:11.8965880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8965987Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.8966092Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.8966532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.8966630Z return fn(*args, **kwargs) 2025-05-07T20:33:11.8967158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.8967264Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.8967643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.8967884Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.8968244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.8968342Z kernel = self.compile( 2025-05-07T20:33:11.8968758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.8968939Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.8969074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.8969086Z 2025-05-07T20:33:11.8969296Z self = 2025-05-07T20:33:11.8970155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.8970680Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9856c337e0>} 2025-05-07T20:33:11.8971476Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.8971682Z context = 2025-05-07T20:33:11.8971686Z 2025-05-07T20:33:11.8971860Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.8972132Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.8972250Z module_map=module_map) 2025-05-07T20:33:11.8972419Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.8972527Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.8972655Z E ^ 2025-05-07T20:33:11.8973027Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.8973032Z 2025-05-07T20:33:11.8973516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.8973521Z 2025-05-07T20:33:11.8973628Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8973863Z self=, 2025-05-07T20:33:11.8973989Z T=16384, 2025-05-07T20:33:11.8974072Z D=5120, 2025-05-07T20:33:11.8974165Z scale_ub=None, 2025-05-07T20:33:11.8974256Z contiguous=False, 2025-05-07T20:33:11.8974391Z compiled=False, 2025-05-07T20:33:11.8974477Z ) 2025-05-07T20:33:11.8974709Z self = 2025-05-07T20:33:11.8974893Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:11.8974900Z 2025-05-07T20:33:11.8974990Z @given( 2025-05-07T20:33:11.8975108Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8975207Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8975329Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8975445Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8975563Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8975685Z ) 2025-05-07T20:33:11.8975939Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8976041Z def test_silu_mul_quant( 2025-05-07T20:33:11.8976119Z self, 2025-05-07T20:33:11.8976198Z T: int, 2025-05-07T20:33:11.8976279Z D: int, 2025-05-07T20:33:11.8976377Z scale_ub: Optional[float], 2025-05-07T20:33:11.8976468Z contiguous: bool, 2025-05-07T20:33:11.8976557Z compiled: bool, 2025-05-07T20:33:11.8976639Z ) -> None: 2025-05-07T20:33:11.8976733Z torch.manual_seed(2025) 2025-05-07T20:33:11.8976811Z 2025-05-07T20:33:11.8976983Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8977064Z 2025-05-07T20:33:11.8977155Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8977282Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8979235Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.8979245Z 2025-05-07T20:33:11.8979364Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:11.8979369Z 2025-05-07T20:33:11.8979476Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8979726Z self=, 2025-05-07T20:33:11.8979810Z T=4096, 2025-05-07T20:33:11.8979910Z D=7168, 2025-05-07T20:33:11.8980001Z scale_ub=1200.0, 2025-05-07T20:33:11.8980086Z contiguous=True, 2025-05-07T20:33:11.8980177Z compiled=True, 2025-05-07T20:33:11.8980251Z ) 2025-05-07T20:33:11.8980483Z self = 2025-05-07T20:33:11.8980661Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:11.8980666Z 2025-05-07T20:33:11.8980740Z @given( 2025-05-07T20:33:11.8980862Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8980958Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8981119Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8981253Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8981409Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8981518Z ) 2025-05-07T20:33:11.8981900Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8982001Z def test_silu_mul_quant( 2025-05-07T20:33:11.8982091Z self, 2025-05-07T20:33:11.8982172Z T: int, 2025-05-07T20:33:11.8982256Z D: int, 2025-05-07T20:33:11.8982363Z scale_ub: Optional[float], 2025-05-07T20:33:11.8982501Z contiguous: bool, 2025-05-07T20:33:11.8982592Z compiled: bool, 2025-05-07T20:33:11.8982675Z ) -> None: 2025-05-07T20:33:11.8982768Z torch.manual_seed(2025) 2025-05-07T20:33:11.8982841Z 2025-05-07T20:33:11.8983018Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8983095Z 2025-05-07T20:33:11.8983192Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8983323Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8985303Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.8985313Z 2025-05-07T20:33:11.8985439Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:11.8985443Z 2025-05-07T20:33:11.8985548Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8985785Z self=, 2025-05-07T20:33:11.8985869Z T=16384, 2025-05-07T20:33:11.8985951Z D=7168, 2025-05-07T20:33:11.8986038Z scale_ub=None, 2025-05-07T20:33:11.8986122Z contiguous=False, 2025-05-07T20:33:11.8986207Z compiled=False, 2025-05-07T20:33:11.8986285Z ) 2025-05-07T20:33:11.8986509Z self = 2025-05-07T20:33:11.8986690Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:11.8986701Z 2025-05-07T20:33:11.8986781Z @given( 2025-05-07T20:33:11.8986899Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8987008Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8987121Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8987239Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8987359Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8987432Z ) 2025-05-07T20:33:11.8987683Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8987784Z def test_silu_mul_quant( 2025-05-07T20:33:11.8987861Z self, 2025-05-07T20:33:11.8987939Z T: int, 2025-05-07T20:33:11.8988025Z D: int, 2025-05-07T20:33:11.8988127Z scale_ub: Optional[float], 2025-05-07T20:33:11.8988226Z contiguous: bool, 2025-05-07T20:33:11.8988313Z compiled: bool, 2025-05-07T20:33:11.8988389Z ) -> None: 2025-05-07T20:33:11.8988493Z torch.manual_seed(2025) 2025-05-07T20:33:11.8988566Z 2025-05-07T20:33:11.8988740Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8990675Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.8990728Z 2025-05-07T20:33:11.8990884Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.8990888Z 2025-05-07T20:33:11.8991001Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8991234Z self=, 2025-05-07T20:33:11.8991360Z T=2048, 2025-05-07T20:33:11.8991443Z D=7168, 2025-05-07T20:33:11.8991529Z scale_ub=1200.0, 2025-05-07T20:33:11.8991623Z contiguous=True, 2025-05-07T20:33:11.8991709Z compiled=True, 2025-05-07T20:33:11.8991784Z ) 2025-05-07T20:33:11.8992012Z self = 2025-05-07T20:33:11.8992189Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:11.8992198Z 2025-05-07T20:33:11.8992273Z @given( 2025-05-07T20:33:11.8992398Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8992499Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8992611Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8992737Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8992848Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8992927Z ) 2025-05-07T20:33:11.8993230Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8993330Z def test_silu_mul_quant( 2025-05-07T20:33:11.8993418Z self, 2025-05-07T20:33:11.8993493Z T: int, 2025-05-07T20:33:11.8993571Z D: int, 2025-05-07T20:33:11.8993675Z scale_ub: Optional[float], 2025-05-07T20:33:11.8993765Z contiguous: bool, 2025-05-07T20:33:11.8993850Z compiled: bool, 2025-05-07T20:33:11.8993936Z ) -> None: 2025-05-07T20:33:11.8994032Z torch.manual_seed(2025) 2025-05-07T20:33:11.8994104Z 2025-05-07T20:33:11.8994284Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.8994358Z 2025-05-07T20:33:11.8994458Z x_sign = torch.sign(x) 2025-05-07T20:33:11.8994584Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.8996496Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.8996513Z 2025-05-07T20:33:11.8996629Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:11.8996634Z 2025-05-07T20:33:11.8996733Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.8996967Z self=, 2025-05-07T20:33:11.8997047Z T=2048, 2025-05-07T20:33:11.8997128Z D=7168, 2025-05-07T20:33:11.8997217Z scale_ub=None, 2025-05-07T20:33:11.8997303Z contiguous=True, 2025-05-07T20:33:11.8997390Z compiled=False, 2025-05-07T20:33:11.8997473Z ) 2025-05-07T20:33:11.8997698Z self = 2025-05-07T20:33:11.8997882Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:11.8997887Z 2025-05-07T20:33:11.8997961Z @given( 2025-05-07T20:33:11.8998080Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.8998185Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.8998299Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.8998464Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.8998582Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.8998659Z ) 2025-05-07T20:33:11.8998915Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.8999077Z def test_silu_mul_quant( 2025-05-07T20:33:11.8999175Z self, 2025-05-07T20:33:11.8999268Z T: int, 2025-05-07T20:33:11.8999361Z D: int, 2025-05-07T20:33:11.8999471Z scale_ub: Optional[float], 2025-05-07T20:33:11.8999610Z contiguous: bool, 2025-05-07T20:33:11.8999694Z compiled: bool, 2025-05-07T20:33:11.8999771Z ) -> None: 2025-05-07T20:33:11.8999872Z torch.manual_seed(2025) 2025-05-07T20:33:11.8999942Z 2025-05-07T20:33:11.9000113Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9000197Z 2025-05-07T20:33:11.9000289Z > x_sign = torch.sign(x) 2025-05-07T20:33:11.9002255Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.9002265Z 2025-05-07T20:33:11.9002386Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:11.9002391Z 2025-05-07T20:33:11.9002498Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9002727Z self=, 2025-05-07T20:33:11.9002804Z T=1, 2025-05-07T20:33:11.9002887Z D=7168, 2025-05-07T20:33:11.9002971Z scale_ub=1200.0, 2025-05-07T20:33:11.9003061Z contiguous=True, 2025-05-07T20:33:11.9003150Z compiled=False, 2025-05-07T20:33:11.9003224Z ) 2025-05-07T20:33:11.9003446Z self = 2025-05-07T20:33:11.9003625Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:11.9003629Z 2025-05-07T20:33:11.9003709Z @given( 2025-05-07T20:33:11.9003824Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9003933Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9004047Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9004169Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9004286Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9004359Z ) 2025-05-07T20:33:11.9004615Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9004709Z def test_silu_mul_quant( 2025-05-07T20:33:11.9004792Z self, 2025-05-07T20:33:11.9004874Z T: int, 2025-05-07T20:33:11.9004949Z D: int, 2025-05-07T20:33:11.9005047Z scale_ub: Optional[float], 2025-05-07T20:33:11.9005142Z contiguous: bool, 2025-05-07T20:33:11.9005226Z compiled: bool, 2025-05-07T20:33:11.9005309Z ) -> None: 2025-05-07T20:33:11.9005404Z torch.manual_seed(2025) 2025-05-07T20:33:11.9005477Z 2025-05-07T20:33:11.9005656Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9005727Z 2025-05-07T20:33:11.9005820Z x_sign = torch.sign(x) 2025-05-07T20:33:11.9005950Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.9006040Z x = x_sign * x_clamp 2025-05-07T20:33:11.9006123Z x0 = x[:, :D] 2025-05-07T20:33:11.9006207Z x1 = x[:, D:] 2025-05-07T20:33:11.9006280Z 2025-05-07T20:33:11.9006365Z if contiguous: 2025-05-07T20:33:11.9006465Z x0 = x0.contiguous() 2025-05-07T20:33:11.9006607Z x1 = x1.contiguous() 2025-05-07T20:33:11.9006678Z 2025-05-07T20:33:11.9006773Z if scale_ub is not None: 2025-05-07T20:33:11.9006878Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.9007020Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.9007138Z ) 2025-05-07T20:33:11.9007217Z else: 2025-05-07T20:33:11.9007318Z scale_ub_tensor = None 2025-05-07T20:33:11.9007389Z 2025-05-07T20:33:11.9007526Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.9007662Z op = silu_mul_quant 2025-05-07T20:33:11.9007749Z if compiled: 2025-05-07T20:33:11.9007847Z op = torch.compile(op) 2025-05-07T20:33:11.9007961Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.9008034Z 2025-05-07T20:33:11.9008122Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.9008132Z 2025-05-07T20:33:11.9008229Z moe/activation_test.py:117: 2025-05-07T20:33:11.9008369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.9008475Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.9008573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.9009110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.9009213Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.9009667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.9009928Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.9010294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.9010388Z kernel = self.compile( 2025-05-07T20:33:11.9010799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.9010980Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.9011109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.9011114Z 2025-05-07T20:33:11.9011331Z self = 2025-05-07T20:33:11.9012151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.9012677Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9856b22b60>} 2025-05-07T20:33:11.9013471Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.9013673Z context = 2025-05-07T20:33:11.9013678Z 2025-05-07T20:33:11.9013848Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.9014127Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.9014242Z module_map=module_map) 2025-05-07T20:33:11.9014488Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.9014594Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.9014679Z E ^ 2025-05-07T20:33:11.9015048Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.9015053Z 2025-05-07T20:33:11.9015495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.9015548Z 2025-05-07T20:33:11.9015652Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9015879Z self=, 2025-05-07T20:33:11.9015969Z T=128, 2025-05-07T20:33:11.9016048Z D=5120, 2025-05-07T20:33:11.9016168Z scale_ub=None, 2025-05-07T20:33:11.9016258Z contiguous=True, 2025-05-07T20:33:11.9016342Z compiled=False, 2025-05-07T20:33:11.9016422Z ) 2025-05-07T20:33:11.9016651Z self = 2025-05-07T20:33:11.9016864Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:11.9016868Z 2025-05-07T20:33:11.9016947Z @given( 2025-05-07T20:33:11.9017064Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9017162Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9017283Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9017401Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9017513Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9017592Z ) 2025-05-07T20:33:11.9017843Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9017946Z def test_silu_mul_quant( 2025-05-07T20:33:11.9018025Z self, 2025-05-07T20:33:11.9018103Z T: int, 2025-05-07T20:33:11.9018186Z D: int, 2025-05-07T20:33:11.9018331Z scale_ub: Optional[float], 2025-05-07T20:33:11.9018422Z contiguous: bool, 2025-05-07T20:33:11.9018513Z compiled: bool, 2025-05-07T20:33:11.9018590Z ) -> None: 2025-05-07T20:33:11.9018687Z torch.manual_seed(2025) 2025-05-07T20:33:11.9018767Z 2025-05-07T20:33:11.9018936Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9019013Z 2025-05-07T20:33:11.9019109Z x_sign = torch.sign(x) 2025-05-07T20:33:11.9019234Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.9019328Z x = x_sign * x_clamp 2025-05-07T20:33:11.9019405Z x0 = x[:, :D] 2025-05-07T20:33:11.9019484Z x1 = x[:, D:] 2025-05-07T20:33:11.9019561Z 2025-05-07T20:33:11.9019642Z if contiguous: 2025-05-07T20:33:11.9019733Z x0 = x0.contiguous() 2025-05-07T20:33:11.9019826Z x1 = x1.contiguous() 2025-05-07T20:33:11.9019897Z 2025-05-07T20:33:11.9019987Z if scale_ub is not None: 2025-05-07T20:33:11.9020103Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.9020241Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.9020314Z ) 2025-05-07T20:33:11.9020394Z else: 2025-05-07T20:33:11.9020487Z scale_ub_tensor = None 2025-05-07T20:33:11.9020560Z 2025-05-07T20:33:11.9020691Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.9020784Z op = silu_mul_quant 2025-05-07T20:33:11.9020873Z if compiled: 2025-05-07T20:33:11.9020970Z op = torch.compile(op) 2025-05-07T20:33:11.9021075Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.9021154Z 2025-05-07T20:33:11.9021244Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.9021248Z 2025-05-07T20:33:11.9021349Z moe/activation_test.py:117: 2025-05-07T20:33:11.9021487Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.9021593Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.9021693Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.9022230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.9022327Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.9022710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.9022936Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.9023346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.9023447Z kernel = self.compile( 2025-05-07T20:33:11.9023889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.9024074Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.9024210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.9024256Z 2025-05-07T20:33:11.9024466Z self = 2025-05-07T20:33:11.9025283Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.9026146Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9856b23c40>} 2025-05-07T20:33:11.9026960Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.9027249Z context = 2025-05-07T20:33:11.9027254Z 2025-05-07T20:33:11.9027435Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.9027718Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.9027828Z module_map=module_map) 2025-05-07T20:33:11.9027999Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.9028099Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.9028180Z E ^ 2025-05-07T20:33:11.9028559Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.9028564Z 2025-05-07T20:33:11.9029009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.9029014Z 2025-05-07T20:33:11.9029125Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9029362Z self=, 2025-05-07T20:33:11.9029447Z T=128, 2025-05-07T20:33:11.9029529Z D=7168, 2025-05-07T20:33:11.9029616Z scale_ub=None, 2025-05-07T20:33:11.9029706Z contiguous=True, 2025-05-07T20:33:11.9029805Z compiled=False, 2025-05-07T20:33:11.9029883Z ) 2025-05-07T20:33:11.9030113Z self = 2025-05-07T20:33:11.9034999Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:11.9035012Z 2025-05-07T20:33:11.9035101Z @given( 2025-05-07T20:33:11.9035225Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9035328Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9035443Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9035564Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9035686Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9035760Z ) 2025-05-07T20:33:11.9036024Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9036121Z def test_silu_mul_quant( 2025-05-07T20:33:11.9036202Z self, 2025-05-07T20:33:11.9036281Z T: int, 2025-05-07T20:33:11.9036358Z D: int, 2025-05-07T20:33:11.9036457Z scale_ub: Optional[float], 2025-05-07T20:33:11.9036550Z contiguous: bool, 2025-05-07T20:33:11.9036636Z compiled: bool, 2025-05-07T20:33:11.9036838Z ) -> None: 2025-05-07T20:33:11.9036941Z torch.manual_seed(2025) 2025-05-07T20:33:11.9037021Z 2025-05-07T20:33:11.9037196Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9037275Z 2025-05-07T20:33:11.9037369Z x_sign = torch.sign(x) 2025-05-07T20:33:11.9037575Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.9037666Z x = x_sign * x_clamp 2025-05-07T20:33:11.9037746Z x0 = x[:, :D] 2025-05-07T20:33:11.9037836Z x1 = x[:, D:] 2025-05-07T20:33:11.9037910Z 2025-05-07T20:33:11.9038060Z if contiguous: 2025-05-07T20:33:11.9038157Z x0 = x0.contiguous() 2025-05-07T20:33:11.9038251Z x1 = x1.contiguous() 2025-05-07T20:33:11.9038324Z 2025-05-07T20:33:11.9038427Z if scale_ub is not None: 2025-05-07T20:33:11.9038535Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.9038673Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.9038755Z ) 2025-05-07T20:33:11.9038834Z else: 2025-05-07T20:33:11.9038935Z scale_ub_tensor = None 2025-05-07T20:33:11.9039011Z 2025-05-07T20:33:11.9039144Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.9039242Z op = silu_mul_quant 2025-05-07T20:33:11.9039332Z if compiled: 2025-05-07T20:33:11.9039436Z op = torch.compile(op) 2025-05-07T20:33:11.9039544Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.9039672Z 2025-05-07T20:33:11.9039765Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.9039772Z 2025-05-07T20:33:11.9039878Z moe/activation_test.py:117: 2025-05-07T20:33:11.9040013Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.9040118Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.9040226Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.9040761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.9040871Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.9041250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.9041482Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.9041852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.9041949Z kernel = self.compile( 2025-05-07T20:33:11.9042370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.9042548Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.9042681Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.9042686Z 2025-05-07T20:33:11.9042901Z self = 2025-05-07T20:33:11.9043715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.9044239Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9856a00ae0>} 2025-05-07T20:33:11.9045037Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.9045236Z context = 2025-05-07T20:33:11.9045241Z 2025-05-07T20:33:11.9045416Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.9045736Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.9045854Z module_map=module_map) 2025-05-07T20:33:11.9046021Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.9046163Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.9046249Z E ^ 2025-05-07T20:33:11.9046622Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.9046626Z 2025-05-07T20:33:11.9047068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.9047112Z 2025-05-07T20:33:11.9047217Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9047444Z self=, 2025-05-07T20:33:11.9047531Z T=2048, 2025-05-07T20:33:11.9047611Z D=7168, 2025-05-07T20:33:11.9047703Z scale_ub=1200.0, 2025-05-07T20:33:11.9047795Z contiguous=True, 2025-05-07T20:33:11.9047882Z compiled=False, 2025-05-07T20:33:11.9047958Z ) 2025-05-07T20:33:11.9048182Z self = 2025-05-07T20:33:11.9048363Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:11.9048367Z 2025-05-07T20:33:11.9048455Z @given( 2025-05-07T20:33:11.9048572Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9048719Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9048844Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9048959Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9049076Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9049158Z ) 2025-05-07T20:33:11.9049413Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9049514Z def test_silu_mul_quant( 2025-05-07T20:33:11.9049598Z self, 2025-05-07T20:33:11.9049675Z T: int, 2025-05-07T20:33:11.9049753Z D: int, 2025-05-07T20:33:11.9049855Z scale_ub: Optional[float], 2025-05-07T20:33:11.9049947Z contiguous: bool, 2025-05-07T20:33:11.9050039Z compiled: bool, 2025-05-07T20:33:11.9050120Z ) -> None: 2025-05-07T20:33:11.9050213Z torch.manual_seed(2025) 2025-05-07T20:33:11.9050284Z 2025-05-07T20:33:11.9050456Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9052380Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.9052401Z 2025-05-07T20:33:11.9052515Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.9052519Z 2025-05-07T20:33:11.9052618Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9052862Z self=, 2025-05-07T20:33:11.9052941Z T=1, 2025-05-07T20:33:11.9053024Z D=5120, 2025-05-07T20:33:11.9053114Z scale_ub=1200.0, 2025-05-07T20:33:11.9053206Z contiguous=True, 2025-05-07T20:33:11.9053301Z compiled=False, 2025-05-07T20:33:11.9053377Z ) 2025-05-07T20:33:11.9053598Z self = 2025-05-07T20:33:11.9053775Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:11.9053779Z 2025-05-07T20:33:11.9053857Z @given( 2025-05-07T20:33:11.9053973Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9054119Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9054230Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9054438Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9054591Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9054667Z ) 2025-05-07T20:33:11.9054919Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9055014Z def test_silu_mul_quant( 2025-05-07T20:33:11.9055092Z self, 2025-05-07T20:33:11.9055213Z T: int, 2025-05-07T20:33:11.9055288Z D: int, 2025-05-07T20:33:11.9055383Z scale_ub: Optional[float], 2025-05-07T20:33:11.9055478Z contiguous: bool, 2025-05-07T20:33:11.9055560Z compiled: bool, 2025-05-07T20:33:11.9055636Z ) -> None: 2025-05-07T20:33:11.9055736Z torch.manual_seed(2025) 2025-05-07T20:33:11.9055813Z 2025-05-07T20:33:11.9055991Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9056080Z 2025-05-07T20:33:11.9056177Z x_sign = torch.sign(x) 2025-05-07T20:33:11.9056306Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.9056402Z x = x_sign * x_clamp 2025-05-07T20:33:11.9056486Z x0 = x[:, :D] 2025-05-07T20:33:11.9056576Z x1 = x[:, D:] 2025-05-07T20:33:11.9056653Z 2025-05-07T20:33:11.9056742Z if contiguous: 2025-05-07T20:33:11.9056889Z x0 = x0.contiguous() 2025-05-07T20:33:11.9056978Z x1 = x1.contiguous() 2025-05-07T20:33:11.9057056Z 2025-05-07T20:33:11.9057155Z if scale_ub is not None: 2025-05-07T20:33:11.9057262Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.9057400Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.9057482Z ) 2025-05-07T20:33:11.9057563Z else: 2025-05-07T20:33:11.9057664Z scale_ub_tensor = None 2025-05-07T20:33:11.9057740Z 2025-05-07T20:33:11.9057871Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.9057966Z op = silu_mul_quant 2025-05-07T20:33:11.9058053Z if compiled: 2025-05-07T20:33:11.9058150Z op = torch.compile(op) 2025-05-07T20:33:11.9058257Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.9058333Z 2025-05-07T20:33:11.9058424Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.9058429Z 2025-05-07T20:33:11.9058528Z moe/activation_test.py:117: 2025-05-07T20:33:11.9058664Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.9058770Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.9058868Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.9059395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.9059500Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.9059881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.9060108Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.9060474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.9060568Z kernel = self.compile( 2025-05-07T20:33:11.9060978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.9061161Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.9061291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.9061295Z 2025-05-07T20:33:11.9061506Z self = 2025-05-07T20:33:11.9062321Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.9062920Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f9856a020c0>} 2025-05-07T20:33:11.9063721Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.9063957Z context = 2025-05-07T20:33:11.9063966Z 2025-05-07T20:33:11.9064137Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.9064408Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.9064521Z module_map=module_map) 2025-05-07T20:33:11.9064681Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.9064776Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.9064851Z E ^ 2025-05-07T20:33:11.9065224Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.9065229Z 2025-05-07T20:33:11.9065750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.9065760Z 2025-05-07T20:33:11.9065860Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9066089Z self=, 2025-05-07T20:33:11.9066167Z T=2048, 2025-05-07T20:33:11.9066243Z D=5120, 2025-05-07T20:33:11.9066326Z scale_ub=None, 2025-05-07T20:33:11.9066412Z contiguous=True, 2025-05-07T20:33:11.9066493Z compiled=False, 2025-05-07T20:33:11.9066570Z ) 2025-05-07T20:33:11.9066802Z self = 2025-05-07T20:33:11.9066981Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:11.9066985Z 2025-05-07T20:33:11.9067068Z @given( 2025-05-07T20:33:11.9067185Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9067283Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9067399Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9067513Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9067627Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9067706Z ) 2025-05-07T20:33:11.9067955Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9068045Z def test_silu_mul_quant( 2025-05-07T20:33:11.9068123Z self, 2025-05-07T20:33:11.9068198Z T: int, 2025-05-07T20:33:11.9068282Z D: int, 2025-05-07T20:33:11.9068376Z scale_ub: Optional[float], 2025-05-07T20:33:11.9068465Z contiguous: bool, 2025-05-07T20:33:11.9068551Z compiled: bool, 2025-05-07T20:33:11.9068630Z ) -> None: 2025-05-07T20:33:11.9068727Z torch.manual_seed(2025) 2025-05-07T20:33:11.9068803Z 2025-05-07T20:33:11.9068974Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9069047Z 2025-05-07T20:33:11.9069140Z > x_sign = torch.sign(x) 2025-05-07T20:33:11.9071063Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.9071119Z 2025-05-07T20:33:11.9071241Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:11.9071245Z 2025-05-07T20:33:11.9071350Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9071624Z self=, 2025-05-07T20:33:11.9071706Z T=16384, 2025-05-07T20:33:11.9071778Z D=5120, 2025-05-07T20:33:11.9071871Z scale_ub=None, 2025-05-07T20:33:11.9071954Z contiguous=True, 2025-05-07T20:33:11.9072075Z compiled=False, 2025-05-07T20:33:11.9072146Z ) 2025-05-07T20:33:11.9072366Z self = 2025-05-07T20:33:11.9072545Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:11.9072550Z 2025-05-07T20:33:11.9072635Z @given( 2025-05-07T20:33:11.9072754Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9072864Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9072977Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9073092Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9073208Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9073283Z ) 2025-05-07T20:33:11.9073533Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9073626Z def test_silu_mul_quant( 2025-05-07T20:33:11.9073746Z self, 2025-05-07T20:33:11.9073820Z T: int, 2025-05-07T20:33:11.9073899Z D: int, 2025-05-07T20:33:11.9073994Z scale_ub: Optional[float], 2025-05-07T20:33:11.9074082Z contiguous: bool, 2025-05-07T20:33:11.9074168Z compiled: bool, 2025-05-07T20:33:11.9074244Z ) -> None: 2025-05-07T20:33:11.9074336Z torch.manual_seed(2025) 2025-05-07T20:33:11.9074406Z 2025-05-07T20:33:11.9074576Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9076506Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.9076513Z 2025-05-07T20:33:11.9076629Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.9076633Z 2025-05-07T20:33:11.9076733Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9076956Z self=, 2025-05-07T20:33:11.9077033Z T=4096, 2025-05-07T20:33:11.9077113Z D=5120, 2025-05-07T20:33:11.9077194Z scale_ub=None, 2025-05-07T20:33:11.9077279Z contiguous=True, 2025-05-07T20:33:11.9077368Z compiled=False, 2025-05-07T20:33:11.9077441Z ) 2025-05-07T20:33:11.9077663Z self = 2025-05-07T20:33:11.9077844Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:11.9077849Z 2025-05-07T20:33:11.9077923Z @given( 2025-05-07T20:33:11.9078041Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9078136Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9078249Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9078370Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9078478Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9078552Z ) 2025-05-07T20:33:11.9078806Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9078946Z def test_silu_mul_quant( 2025-05-07T20:33:11.9079022Z self, 2025-05-07T20:33:11.9079098Z T: int, 2025-05-07T20:33:11.9079177Z D: int, 2025-05-07T20:33:11.9079278Z scale_ub: Optional[float], 2025-05-07T20:33:11.9079364Z contiguous: bool, 2025-05-07T20:33:11.9079482Z compiled: bool, 2025-05-07T20:33:11.9079565Z ) -> None: 2025-05-07T20:33:11.9079657Z torch.manual_seed(2025) 2025-05-07T20:33:11.9079729Z 2025-05-07T20:33:11.9079906Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9081855Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.9081864Z 2025-05-07T20:33:11.9081983Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.9081988Z 2025-05-07T20:33:11.9082094Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9082334Z self=, 2025-05-07T20:33:11.9082414Z T=2048, 2025-05-07T20:33:11.9082531Z D=5120, 2025-05-07T20:33:11.9082623Z scale_ub=None, 2025-05-07T20:33:11.9082709Z contiguous=False, 2025-05-07T20:33:11.9082791Z compiled=False, 2025-05-07T20:33:11.9082871Z ) 2025-05-07T20:33:11.9083091Z self = 2025-05-07T20:33:11.9083267Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:11.9083272Z 2025-05-07T20:33:11.9083351Z @given( 2025-05-07T20:33:11.9083467Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9083569Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9083681Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9083797Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9083917Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9083984Z ) 2025-05-07T20:33:11.9084237Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9084333Z def test_silu_mul_quant( 2025-05-07T20:33:11.9084410Z self, 2025-05-07T20:33:11.9084483Z T: int, 2025-05-07T20:33:11.9084565Z D: int, 2025-05-07T20:33:11.9084661Z scale_ub: Optional[float], 2025-05-07T20:33:11.9084747Z contiguous: bool, 2025-05-07T20:33:11.9084835Z compiled: bool, 2025-05-07T20:33:11.9084912Z ) -> None: 2025-05-07T20:33:11.9085006Z torch.manual_seed(2025) 2025-05-07T20:33:11.9085075Z 2025-05-07T20:33:11.9085246Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9087158Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.9087166Z 2025-05-07T20:33:11.9087281Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.9087286Z 2025-05-07T20:33:11.9087389Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9087615Z self=, 2025-05-07T20:33:11.9087735Z T=4096, 2025-05-07T20:33:11.9087818Z D=7168, 2025-05-07T20:33:11.9087907Z scale_ub=None, 2025-05-07T20:33:11.9087993Z contiguous=True, 2025-05-07T20:33:11.9088080Z compiled=True, 2025-05-07T20:33:11.9088154Z ) 2025-05-07T20:33:11.9088411Z self = 2025-05-07T20:33:11.9088593Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:11.9088597Z 2025-05-07T20:33:11.9088678Z @given( 2025-05-07T20:33:11.9088797Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9088943Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9089059Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9089185Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9089301Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9089373Z ) 2025-05-07T20:33:11.9089627Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9089728Z def test_silu_mul_quant( 2025-05-07T20:33:11.9089808Z self, 2025-05-07T20:33:11.9089887Z T: int, 2025-05-07T20:33:11.9089963Z D: int, 2025-05-07T20:33:11.9090063Z scale_ub: Optional[float], 2025-05-07T20:33:11.9090155Z contiguous: bool, 2025-05-07T20:33:11.9090243Z compiled: bool, 2025-05-07T20:33:11.9090323Z ) -> None: 2025-05-07T20:33:11.9090460Z torch.manual_seed(2025) 2025-05-07T20:33:11.9090539Z 2025-05-07T20:33:11.9090726Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9092636Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.9092645Z 2025-05-07T20:33:11.9092768Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.9092772Z 2025-05-07T20:33:11.9092872Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9093103Z self=, 2025-05-07T20:33:11.9093179Z T=2048, 2025-05-07T20:33:11.9093255Z D=5120, 2025-05-07T20:33:11.9093338Z scale_ub=1200.0, 2025-05-07T20:33:11.9093425Z contiguous=False, 2025-05-07T20:33:11.9093510Z compiled=False, 2025-05-07T20:33:11.9093589Z ) 2025-05-07T20:33:11.9093808Z self = 2025-05-07T20:33:11.9093986Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:11.9093993Z 2025-05-07T20:33:11.9094076Z @given( 2025-05-07T20:33:11.9094194Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9094296Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9094530Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9094650Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9094764Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9094835Z ) 2025-05-07T20:33:11.9095087Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9095185Z def test_silu_mul_quant( 2025-05-07T20:33:11.9095260Z self, 2025-05-07T20:33:11.9095337Z T: int, 2025-05-07T20:33:11.9095419Z D: int, 2025-05-07T20:33:11.9095520Z scale_ub: Optional[float], 2025-05-07T20:33:11.9095609Z contiguous: bool, 2025-05-07T20:33:11.9095700Z compiled: bool, 2025-05-07T20:33:11.9095777Z ) -> None: 2025-05-07T20:33:11.9095932Z torch.manual_seed(2025) 2025-05-07T20:33:11.9096004Z 2025-05-07T20:33:11.9096172Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9098127Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.9098192Z 2025-05-07T20:33:11.9098314Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.9098318Z 2025-05-07T20:33:11.9098418Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9098650Z self=, 2025-05-07T20:33:11.9098730Z T=4096, 2025-05-07T20:33:11.9098802Z D=7168, 2025-05-07T20:33:11.9098887Z scale_ub=1200.0, 2025-05-07T20:33:11.9098968Z contiguous=True, 2025-05-07T20:33:11.9099050Z compiled=False, 2025-05-07T20:33:11.9099128Z ) 2025-05-07T20:33:11.9099349Z self = 2025-05-07T20:33:11.9099563Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:11.9099567Z 2025-05-07T20:33:11.9099646Z @given( 2025-05-07T20:33:11.9099760Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9099856Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9099970Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9100081Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9100192Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9100269Z ) 2025-05-07T20:33:11.9100517Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9100611Z def test_silu_mul_quant( 2025-05-07T20:33:11.9100686Z self, 2025-05-07T20:33:11.9100761Z T: int, 2025-05-07T20:33:11.9100843Z D: int, 2025-05-07T20:33:11.9100939Z scale_ub: Optional[float], 2025-05-07T20:33:11.9101026Z contiguous: bool, 2025-05-07T20:33:11.9101110Z compiled: bool, 2025-05-07T20:33:11.9101187Z ) -> None: 2025-05-07T20:33:11.9101277Z torch.manual_seed(2025) 2025-05-07T20:33:11.9101351Z 2025-05-07T20:33:11.9101517Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9103428Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.9103439Z 2025-05-07T20:33:11.9103557Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.9103562Z 2025-05-07T20:33:11.9103666Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9103892Z self=, 2025-05-07T20:33:11.9103972Z T=16384, 2025-05-07T20:33:11.9104055Z D=7168, 2025-05-07T20:33:11.9104138Z scale_ub=None, 2025-05-07T20:33:11.9104226Z contiguous=False, 2025-05-07T20:33:11.9104313Z compiled=True, 2025-05-07T20:33:11.9104387Z ) 2025-05-07T20:33:11.9104606Z self = 2025-05-07T20:33:11.9104831Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:11.9104836Z 2025-05-07T20:33:11.9104913Z @given( 2025-05-07T20:33:11.9105034Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9105132Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9105282Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9105401Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9105513Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9105589Z ) 2025-05-07T20:33:11.9105881Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9105971Z def test_silu_mul_quant( 2025-05-07T20:33:11.9106052Z self, 2025-05-07T20:33:11.9106125Z T: int, 2025-05-07T20:33:11.9106205Z D: int, 2025-05-07T20:33:11.9106307Z scale_ub: Optional[float], 2025-05-07T20:33:11.9106394Z contiguous: bool, 2025-05-07T20:33:11.9106478Z compiled: bool, 2025-05-07T20:33:11.9106556Z ) -> None: 2025-05-07T20:33:11.9106648Z torch.manual_seed(2025) 2025-05-07T20:33:11.9106717Z 2025-05-07T20:33:11.9106889Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9108841Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.9108850Z 2025-05-07T20:33:11.9108970Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.9108978Z 2025-05-07T20:33:11.9109088Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9109352Z self=, 2025-05-07T20:33:11.9109428Z T=4096, 2025-05-07T20:33:11.9109503Z D=7168, 2025-05-07T20:33:11.9109591Z scale_ub=None, 2025-05-07T20:33:11.9109676Z contiguous=True, 2025-05-07T20:33:11.9109759Z compiled=False, 2025-05-07T20:33:11.9109833Z ) 2025-05-07T20:33:11.9110051Z self = 2025-05-07T20:33:11.9110224Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:11.9110230Z 2025-05-07T20:33:11.9110313Z @given( 2025-05-07T20:33:11.9110432Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9110529Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9110640Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9110754Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9110870Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9110946Z ) 2025-05-07T20:33:11.9111194Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9111286Z def test_silu_mul_quant( 2025-05-07T20:33:11.9111362Z self, 2025-05-07T20:33:11.9111440Z T: int, 2025-05-07T20:33:11.9111520Z D: int, 2025-05-07T20:33:11.9111616Z scale_ub: Optional[float], 2025-05-07T20:33:11.9111706Z contiguous: bool, 2025-05-07T20:33:11.9111789Z compiled: bool, 2025-05-07T20:33:11.9111867Z ) -> None: 2025-05-07T20:33:11.9111961Z torch.manual_seed(2025) 2025-05-07T20:33:11.9112036Z 2025-05-07T20:33:11.9112209Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9114165Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.9114206Z 2025-05-07T20:33:11.9114323Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.9114330Z 2025-05-07T20:33:11.9114433Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9114697Z self=, 2025-05-07T20:33:11.9114776Z T=16384, 2025-05-07T20:33:11.9114852Z D=7168, 2025-05-07T20:33:11.9114933Z scale_ub=None, 2025-05-07T20:33:11.9115019Z contiguous=True, 2025-05-07T20:33:11.9115107Z compiled=False, 2025-05-07T20:33:11.9115181Z ) 2025-05-07T20:33:11.9115401Z self = 2025-05-07T20:33:11.9115585Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:11.9115590Z 2025-05-07T20:33:11.9115662Z @given( 2025-05-07T20:33:11.9115778Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9115874Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9115986Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9116147Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9116259Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9116335Z ) 2025-05-07T20:33:11.9116590Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9116682Z def test_silu_mul_quant( 2025-05-07T20:33:11.9116760Z self, 2025-05-07T20:33:11.9116832Z T: int, 2025-05-07T20:33:11.9116906Z D: int, 2025-05-07T20:33:11.9117003Z scale_ub: Optional[float], 2025-05-07T20:33:11.9117092Z contiguous: bool, 2025-05-07T20:33:11.9117173Z compiled: bool, 2025-05-07T20:33:11.9117250Z ) -> None: 2025-05-07T20:33:11.9117344Z torch.manual_seed(2025) 2025-05-07T20:33:11.9117416Z 2025-05-07T20:33:11.9117592Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9119550Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.9119560Z 2025-05-07T20:33:11.9119679Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.9119683Z 2025-05-07T20:33:11.9119783Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9120010Z self=, 2025-05-07T20:33:11.9120086Z T=16384, 2025-05-07T20:33:11.9120164Z D=7168, 2025-05-07T20:33:11.9120254Z scale_ub=1200.0, 2025-05-07T20:33:11.9120337Z contiguous=True, 2025-05-07T20:33:11.9120418Z compiled=False, 2025-05-07T20:33:11.9120497Z ) 2025-05-07T20:33:11.9120718Z self = 2025-05-07T20:33:11.9120898Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:11.9120903Z 2025-05-07T20:33:11.9120984Z @given( 2025-05-07T20:33:11.9121103Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9121199Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9121314Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9121474Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9121590Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9121663Z ) 2025-05-07T20:33:11.9121911Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9122040Z def test_silu_mul_quant( 2025-05-07T20:33:11.9122116Z self, 2025-05-07T20:33:11.9122190Z T: int, 2025-05-07T20:33:11.9122267Z D: int, 2025-05-07T20:33:11.9122365Z scale_ub: Optional[float], 2025-05-07T20:33:11.9122491Z contiguous: bool, 2025-05-07T20:33:11.9122579Z compiled: bool, 2025-05-07T20:33:11.9122654Z ) -> None: 2025-05-07T20:33:11.9122744Z torch.manual_seed(2025) 2025-05-07T20:33:11.9122818Z 2025-05-07T20:33:11.9122990Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9124901Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.9124948Z 2025-05-07T20:33:11.9125065Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.9125072Z 2025-05-07T20:33:11.9125178Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9125588Z self=, 2025-05-07T20:33:11.9125703Z T=128, 2025-05-07T20:33:11.9125815Z D=5120, 2025-05-07T20:33:11.9125914Z scale_ub=1200.0, 2025-05-07T20:33:11.9126004Z contiguous=False, 2025-05-07T20:33:11.9126090Z compiled=False, 2025-05-07T20:33:11.9126163Z ) 2025-05-07T20:33:11.9126382Z self = 2025-05-07T20:33:11.9126556Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:11.9126561Z 2025-05-07T20:33:11.9126636Z @given( 2025-05-07T20:33:11.9126756Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9126853Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9126964Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9127083Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9127195Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9127263Z ) 2025-05-07T20:33:11.9127512Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9127600Z def test_silu_mul_quant( 2025-05-07T20:33:11.9127679Z self, 2025-05-07T20:33:11.9127753Z T: int, 2025-05-07T20:33:11.9127828Z D: int, 2025-05-07T20:33:11.9127926Z scale_ub: Optional[float], 2025-05-07T20:33:11.9128013Z contiguous: bool, 2025-05-07T20:33:11.9128096Z compiled: bool, 2025-05-07T20:33:11.9128174Z ) -> None: 2025-05-07T20:33:11.9128264Z torch.manual_seed(2025) 2025-05-07T20:33:11.9128339Z 2025-05-07T20:33:11.9128510Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9128583Z 2025-05-07T20:33:11.9128673Z x_sign = torch.sign(x) 2025-05-07T20:33:11.9128797Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.9128884Z x = x_sign * x_clamp 2025-05-07T20:33:11.9128963Z x0 = x[:, :D] 2025-05-07T20:33:11.9129046Z x1 = x[:, D:] 2025-05-07T20:33:11.9129115Z 2025-05-07T20:33:11.9129201Z if contiguous: 2025-05-07T20:33:11.9129290Z x0 = x0.contiguous() 2025-05-07T20:33:11.9129381Z x1 = x1.contiguous() 2025-05-07T20:33:11.9129539Z 2025-05-07T20:33:11.9129629Z if scale_ub is not None: 2025-05-07T20:33:11.9129733Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.9129870Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.9129946Z ) 2025-05-07T20:33:11.9130019Z else: 2025-05-07T20:33:11.9130197Z scale_ub_tensor = None 2025-05-07T20:33:11.9130269Z 2025-05-07T20:33:11.9130396Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.9130489Z op = silu_mul_quant 2025-05-07T20:33:11.9130570Z if compiled: 2025-05-07T20:33:11.9130728Z op = torch.compile(op) 2025-05-07T20:33:11.9130830Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.9130899Z 2025-05-07T20:33:11.9130989Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.9130993Z 2025-05-07T20:33:11.9131087Z moe/activation_test.py:117: 2025-05-07T20:33:11.9131215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.9131318Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.9131413Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.9131939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.9132045Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.9132480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.9132709Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.9133067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.9133157Z kernel = self.compile( 2025-05-07T20:33:11.9133562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.9133739Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.9133875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.9133880Z 2025-05-07T20:33:11.9134087Z self = 2025-05-07T20:33:11.9134976Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.9135493Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98569cccc0>} 2025-05-07T20:33:11.9136285Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.9136483Z context = 2025-05-07T20:33:11.9136487Z 2025-05-07T20:33:11.9136652Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.9136924Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.9137030Z module_map=module_map) 2025-05-07T20:33:11.9137189Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.9137291Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.9137366Z E ^ 2025-05-07T20:33:11.9137732Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.9137737Z 2025-05-07T20:33:11.9138173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.9138178Z 2025-05-07T20:33:11.9138278Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9138561Z self=, 2025-05-07T20:33:11.9138637Z T=2048, 2025-05-07T20:33:11.9138710Z D=7168, 2025-05-07T20:33:11.9138796Z scale_ub=None, 2025-05-07T20:33:11.9138880Z contiguous=False, 2025-05-07T20:33:11.9138998Z compiled=False, 2025-05-07T20:33:11.9139078Z ) 2025-05-07T20:33:11.9139305Z self = 2025-05-07T20:33:11.9139485Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:11.9139527Z 2025-05-07T20:33:11.9139607Z @given( 2025-05-07T20:33:11.9139720Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9139821Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9139931Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9140044Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9140156Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9140231Z ) 2025-05-07T20:33:11.9140480Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9140574Z def test_silu_mul_quant( 2025-05-07T20:33:11.9140646Z self, 2025-05-07T20:33:11.9140721Z T: int, 2025-05-07T20:33:11.9140802Z D: int, 2025-05-07T20:33:11.9140897Z scale_ub: Optional[float], 2025-05-07T20:33:11.9140981Z contiguous: bool, 2025-05-07T20:33:11.9141108Z compiled: bool, 2025-05-07T20:33:11.9141186Z ) -> None: 2025-05-07T20:33:11.9141281Z torch.manual_seed(2025) 2025-05-07T20:33:11.9141351Z 2025-05-07T20:33:11.9141520Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9144728Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.9144738Z 2025-05-07T20:33:11.9144854Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.9144858Z 2025-05-07T20:33:11.9144961Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9145192Z self=, 2025-05-07T20:33:11.9145267Z T=128, 2025-05-07T20:33:11.9145347Z D=7168, 2025-05-07T20:33:11.9145427Z scale_ub=1200.0, 2025-05-07T20:33:11.9145510Z contiguous=True, 2025-05-07T20:33:11.9145594Z compiled=True, 2025-05-07T20:33:11.9145667Z ) 2025-05-07T20:33:11.9145889Z self = 2025-05-07T20:33:11.9146063Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:11.9146068Z 2025-05-07T20:33:11.9146145Z @given( 2025-05-07T20:33:11.9146262Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9146361Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9146473Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9146588Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9146698Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9146770Z ) 2025-05-07T20:33:11.9147022Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9147116Z def test_silu_mul_quant( 2025-05-07T20:33:11.9147188Z self, 2025-05-07T20:33:11.9147263Z T: int, 2025-05-07T20:33:11.9147339Z D: int, 2025-05-07T20:33:11.9147439Z scale_ub: Optional[float], 2025-05-07T20:33:11.9147569Z contiguous: bool, 2025-05-07T20:33:11.9147654Z compiled: bool, 2025-05-07T20:33:11.9147733Z ) -> None: 2025-05-07T20:33:11.9147828Z torch.manual_seed(2025) 2025-05-07T20:33:11.9147895Z 2025-05-07T20:33:11.9148067Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9148178Z 2025-05-07T20:33:11.9148267Z x_sign = torch.sign(x) 2025-05-07T20:33:11.9148391Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.9148480Z x = x_sign * x_clamp 2025-05-07T20:33:11.9148560Z x0 = x[:, :D] 2025-05-07T20:33:11.9148677Z x1 = x[:, D:] 2025-05-07T20:33:11.9148747Z 2025-05-07T20:33:11.9148835Z if contiguous: 2025-05-07T20:33:11.9148924Z x0 = x0.contiguous() 2025-05-07T20:33:11.9149010Z x1 = x1.contiguous() 2025-05-07T20:33:11.9149082Z 2025-05-07T20:33:11.9149168Z if scale_ub is not None: 2025-05-07T20:33:11.9149271Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:11.9149410Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:11.9149485Z ) 2025-05-07T20:33:11.9149556Z else: 2025-05-07T20:33:11.9149650Z scale_ub_tensor = None 2025-05-07T20:33:11.9149723Z 2025-05-07T20:33:11.9149853Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:11.9149946Z op = silu_mul_quant 2025-05-07T20:33:11.9150028Z if compiled: 2025-05-07T20:33:11.9150168Z op = torch.compile(op) 2025-05-07T20:33:11.9150271Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.9150345Z 2025-05-07T20:33:11.9150434Z > y_fp8, y_scale = fn() 2025-05-07T20:33:11.9150438Z 2025-05-07T20:33:11.9150531Z moe/activation_test.py:117: 2025-05-07T20:33:11.9150661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.9150763Z moe/activation_test.py:115: in fn 2025-05-07T20:33:11.9150860Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:11.9151252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:11.9151348Z return fn(*args, **kwargs) 2025-05-07T20:33:11.9151869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:11.9151974Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:11.9152350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:11.9152579Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:11.9152942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:11.9153032Z kernel = self.compile( 2025-05-07T20:33:11.9153435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:11.9153611Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:11.9153740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:11.9153744Z 2025-05-07T20:33:11.9153957Z self = 2025-05-07T20:33:11.9154767Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:11.9155281Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f98569cda80>} 2025-05-07T20:33:11.9156072Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:11.9156310Z context = 2025-05-07T20:33:11.9156321Z 2025-05-07T20:33:11.9156489Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:11.9156802Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:11.9156914Z module_map=module_map) 2025-05-07T20:33:11.9157073Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:11.9157211Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:11.9157288Z E ^ 2025-05-07T20:33:11.9157655Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:11.9157660Z 2025-05-07T20:33:11.9158099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:11.9158109Z 2025-05-07T20:33:11.9158207Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9158431Z self=, 2025-05-07T20:33:11.9158516Z T=128, 2025-05-07T20:33:11.9158591Z D=7168, 2025-05-07T20:33:11.9158670Z scale_ub=1200.0, 2025-05-07T20:33:11.9158757Z contiguous=True, 2025-05-07T20:33:11.9158837Z compiled=False, 2025-05-07T20:33:11.9158905Z ) 2025-05-07T20:33:11.9159169Z self = 2025-05-07T20:33:11.9159344Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:11.9159348Z 2025-05-07T20:33:11.9159427Z @given( 2025-05-07T20:33:11.9159541Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9159637Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9159750Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9159862Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9159974Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9160045Z ) 2025-05-07T20:33:11.9160290Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9160382Z def test_silu_mul_quant( 2025-05-07T20:33:11.9160461Z self, 2025-05-07T20:33:11.9160538Z T: int, 2025-05-07T20:33:11.9160616Z D: int, 2025-05-07T20:33:11.9160714Z scale_ub: Optional[float], 2025-05-07T20:33:11.9160804Z contiguous: bool, 2025-05-07T20:33:11.9160894Z compiled: bool, 2025-05-07T20:33:11.9160973Z ) -> None: 2025-05-07T20:33:11.9165421Z torch.manual_seed(2025) 2025-05-07T20:33:11.9165514Z 2025-05-07T20:33:11.9165706Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9165781Z 2025-05-07T20:33:11.9165874Z x_sign = torch.sign(x) 2025-05-07T20:33:11.9166007Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.9167927Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.9167936Z 2025-05-07T20:33:11.9168065Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:11.9168070Z 2025-05-07T20:33:11.9168171Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9168397Z self=, 2025-05-07T20:33:11.9168482Z T=128, 2025-05-07T20:33:11.9168559Z D=5120, 2025-05-07T20:33:11.9168739Z scale_ub=1200.0, 2025-05-07T20:33:11.9168829Z contiguous=True, 2025-05-07T20:33:11.9168913Z compiled=True, 2025-05-07T20:33:11.9168996Z ) 2025-05-07T20:33:11.9169227Z self = 2025-05-07T20:33:11.9169443Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:11.9169448Z 2025-05-07T20:33:11.9169532Z @given( 2025-05-07T20:33:11.9169653Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9169751Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9169908Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9170026Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9170140Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9170220Z ) 2025-05-07T20:33:11.9170470Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9170571Z def test_silu_mul_quant( 2025-05-07T20:33:11.9170655Z self, 2025-05-07T20:33:11.9170735Z T: int, 2025-05-07T20:33:11.9170821Z D: int, 2025-05-07T20:33:11.9170918Z scale_ub: Optional[float], 2025-05-07T20:33:11.9171007Z contiguous: bool, 2025-05-07T20:33:11.9171096Z compiled: bool, 2025-05-07T20:33:11.9171179Z ) -> None: 2025-05-07T20:33:11.9171278Z torch.manual_seed(2025) 2025-05-07T20:33:11.9171357Z 2025-05-07T20:33:11.9171570Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9171649Z 2025-05-07T20:33:11.9171759Z x_sign = torch.sign(x) 2025-05-07T20:33:11.9171887Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:11.9173794Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.9173805Z 2025-05-07T20:33:11.9173925Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:11.9173929Z 2025-05-07T20:33:11.9174033Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:11.9174262Z self=, 2025-05-07T20:33:11.9174462Z T=128, 2025-05-07T20:33:11.9174543Z D=7168, 2025-05-07T20:33:11.9174625Z scale_ub=None, 2025-05-07T20:33:11.9174713Z contiguous=True, 2025-05-07T20:33:11.9174802Z compiled=True, 2025-05-07T20:33:11.9174877Z ) 2025-05-07T20:33:11.9175096Z self = 2025-05-07T20:33:11.9175275Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:11.9175279Z 2025-05-07T20:33:11.9175355Z @given( 2025-05-07T20:33:11.9175479Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:11.9175578Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:11.9175695Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:11.9175821Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:11.9175939Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:11.9176012Z ) 2025-05-07T20:33:11.9176270Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:11.9176364Z def test_silu_mul_quant( 2025-05-07T20:33:11.9176444Z self, 2025-05-07T20:33:11.9176526Z T: int, 2025-05-07T20:33:11.9176603Z D: int, 2025-05-07T20:33:11.9176703Z scale_ub: Optional[float], 2025-05-07T20:33:11.9176794Z contiguous: bool, 2025-05-07T20:33:11.9176930Z compiled: bool, 2025-05-07T20:33:11.9177017Z ) -> None: 2025-05-07T20:33:11.9177111Z torch.manual_seed(2025) 2025-05-07T20:33:11.9177184Z 2025-05-07T20:33:11.9177358Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:11.9179296Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:11.9179338Z 2025-05-07T20:33:11.9179500Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:11.9179638Z =============================== warnings summary =============================== 2025-05-07T20:33:11.9179964Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:11.9180274Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:11.9180586Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:11.9181564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:11.9181806Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:11.9181811Z 2025-05-07T20:33:11.9182028Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:11.9182200Z ================= 1 failed, 1 deselected, 3 warnings in 15.43s ================= 2025-05-07T20:33:13.6125233Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:13.6775278Z [EXEC] [ATTEMPT 0/2] Command attempt failed. 2025-05-07T20:33:13.6775814Z 2025-05-07T20:33:15.6793614Z [EXEC] [ATTEMPT 1/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:33:17.8497094Z ============================= test session starts ============================== 2025-05-07T20:33:17.8498354Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:33:17.8499427Z cachedir: .pytest_cache 2025-05-07T20:33:17.8500532Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:33:17.8501312Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:33:17.8501744Z plugins: hypothesis-6.131.14 2025-05-07T20:33:19.4957337Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:33:19.6046883Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:33:19.6047321Z run-last-failure: rerun previous 1 failure 2025-05-07T20:33:19.6047545Z 2025-05-07T20:33:22.0555152Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.0555830Z self=, 2025-05-07T20:33:22.0556328Z T=1, 2025-05-07T20:33:22.0556537Z D=5120, 2025-05-07T20:33:22.0556754Z scale_ub=None, 2025-05-07T20:33:22.0556976Z contiguous=True, 2025-05-07T20:33:22.0557541Z compiled=True, 2025-05-07T20:33:22.0557756Z ) 2025-05-07T20:33:22.0558079Z self = 2025-05-07T20:33:22.0558584Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:22.0558858Z 2025-05-07T20:33:22.0558948Z @given( 2025-05-07T20:33:22.0559280Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.0559609Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.0559942Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.0560291Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.0560718Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.0561015Z ) 2025-05-07T20:33:22.0561373Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.0561824Z def test_silu_mul_quant( 2025-05-07T20:33:22.0562073Z self, 2025-05-07T20:33:22.0562274Z T: int, 2025-05-07T20:33:22.0562471Z D: int, 2025-05-07T20:33:22.0562694Z scale_ub: Optional[float], 2025-05-07T20:33:22.0562977Z contiguous: bool, 2025-05-07T20:33:22.0563216Z compiled: bool, 2025-05-07T20:33:22.0563456Z ) -> None: 2025-05-07T20:33:22.0563676Z torch.manual_seed(2025) 2025-05-07T20:33:22.0563922Z 2025-05-07T20:33:22.0564198Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.0564561Z 2025-05-07T20:33:22.0564843Z x_sign = torch.sign(x) 2025-05-07T20:33:22.0565135Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.0565462Z x = x_sign * x_clamp 2025-05-07T20:33:22.0565714Z x0 = x[:, :D] 2025-05-07T20:33:22.0565924Z x1 = x[:, D:] 2025-05-07T20:33:22.0566136Z 2025-05-07T20:33:22.0566326Z if contiguous: 2025-05-07T20:33:22.0566559Z x0 = x0.contiguous() 2025-05-07T20:33:22.0566828Z x1 = x1.contiguous() 2025-05-07T20:33:22.0567083Z 2025-05-07T20:33:22.0567271Z if scale_ub is not None: 2025-05-07T20:33:22.0567555Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.0567904Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.0568211Z ) 2025-05-07T20:33:22.0568409Z else: 2025-05-07T20:33:22.0568628Z scale_ub_tensor = None 2025-05-07T20:33:22.0568886Z 2025-05-07T20:33:22.0569127Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.0569463Z op = silu_mul_quant 2025-05-07T20:33:22.0569726Z if compiled: 2025-05-07T20:33:22.0569977Z op = torch.compile(op) 2025-05-07T20:33:22.0570286Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.0570572Z 2025-05-07T20:33:22.0570762Z y_fp8, y_scale = fn() 2025-05-07T20:33:22.0571057Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:22.0571359Z 2025-05-07T20:33:22.0571600Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.0571955Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:22.0572260Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:22.0572579Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:22.0572957Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.0573276Z 2025-05-07T20:33:22.0573485Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:22.0573688Z 2025-05-07T20:33:22.0573794Z moe/activation_test.py:126: 2025-05-07T20:33:22.0574104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.0574580Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:22.0574912Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.0575740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:22.0576604Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:22.0577184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.0577903Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.0578683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:22.0579456Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:22.0580227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:22.0580952Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:22.0581589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:22.0582141Z fn() 2025-05-07T20:33:22.0582679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:22.0583319Z self.fn.run( 2025-05-07T20:33:22.0583802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.0584369Z kernel = self.compile( 2025-05-07T20:33:22.0584946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.0585679Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.0586102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.0586350Z 2025-05-07T20:33:22.0586563Z self = 2025-05-07T20:33:22.0587696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.0589148Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09a057dc60>} 2025-05-07T20:33:22.0590557Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.0591652Z context = 2025-05-07T20:33:22.0592001Z 2025-05-07T20:33:22.0592187Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.0592735Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.0593215Z module_map=module_map) 2025-05-07T20:33:22.0593593Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.0593978Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:22.0594252Z E ^ 2025-05-07T20:33:22.0594733Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.0595205Z 2025-05-07T20:33:22.0595653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.0596194Z 2025-05-07T20:33:22.0596305Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.0596720Z self=, 2025-05-07T20:33:22.0597143Z T=2048, 2025-05-07T20:33:22.0597332Z D=5120, 2025-05-07T20:33:22.0597520Z scale_ub=1200.0, 2025-05-07T20:33:22.0597741Z contiguous=True, 2025-05-07T20:33:22.0597969Z compiled=False, 2025-05-07T20:33:22.0598174Z ) 2025-05-07T20:33:22.8008956Z self = 2025-05-07T20:33:22.8010212Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:22.8010659Z 2025-05-07T20:33:22.8010774Z @given( 2025-05-07T20:33:22.8011118Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.8011597Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.8012224Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.8012727Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.8013228Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.8013580Z ) 2025-05-07T20:33:22.8014067Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.8014773Z def test_silu_mul_quant( 2025-05-07T20:33:22.8015124Z self, 2025-05-07T20:33:22.8015405Z T: int, 2025-05-07T20:33:22.8015685Z D: int, 2025-05-07T20:33:22.8015984Z scale_ub: Optional[float], 2025-05-07T20:33:22.8016347Z contiguous: bool, 2025-05-07T20:33:22.8016600Z compiled: bool, 2025-05-07T20:33:22.8016834Z ) -> None: 2025-05-07T20:33:22.8017058Z torch.manual_seed(2025) 2025-05-07T20:33:22.8017302Z 2025-05-07T20:33:22.8017578Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.8017943Z 2025-05-07T20:33:22.8018146Z x_sign = torch.sign(x) 2025-05-07T20:33:22.8018437Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.8018759Z x = x_sign * x_clamp 2025-05-07T20:33:22.8019145Z x0 = x[:, :D] 2025-05-07T20:33:22.8019373Z x1 = x[:, D:] 2025-05-07T20:33:22.8019589Z 2025-05-07T20:33:22.8019780Z if contiguous: 2025-05-07T20:33:22.8020037Z x0 = x0.contiguous() 2025-05-07T20:33:22.8020295Z x1 = x1.contiguous() 2025-05-07T20:33:22.8020547Z 2025-05-07T20:33:22.8020754Z if scale_ub is not None: 2025-05-07T20:33:22.8021032Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.8021384Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.8021710Z ) 2025-05-07T20:33:22.8021911Z else: 2025-05-07T20:33:22.8022132Z scale_ub_tensor = None 2025-05-07T20:33:22.8022398Z 2025-05-07T20:33:22.8022634Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.8022968Z op = silu_mul_quant 2025-05-07T20:33:22.8023236Z if compiled: 2025-05-07T20:33:22.8023485Z op = torch.compile(op) 2025-05-07T20:33:22.8023809Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.8024104Z 2025-05-07T20:33:22.8024304Z > y_fp8, y_scale = fn() 2025-05-07T20:33:22.8024473Z 2025-05-07T20:33:22.8024577Z moe/activation_test.py:117: 2025-05-07T20:33:22.8024887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.8025236Z moe/activation_test.py:115: in fn 2025-05-07T20:33:22.8025898Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.8026637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:22.8027365Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:22.8027930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.8028648Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.8029347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.8029910Z kernel = self.compile( 2025-05-07T20:33:22.8030471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.8031159Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.8031566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.8031896Z 2025-05-07T20:33:22.8032113Z self = 2025-05-07T20:33:22.8033294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.8034870Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09a03d4220>} 2025-05-07T20:33:22.8036352Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.8037449Z context = 2025-05-07T20:33:22.8037751Z 2025-05-07T20:33:22.8037932Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.8038472Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.8038966Z module_map=module_map) 2025-05-07T20:33:22.8039352Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.8039729Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:22.8039996Z E ^ 2025-05-07T20:33:22.8040541Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.8041021Z 2025-05-07T20:33:22.8041466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.8042009Z 2025-05-07T20:33:22.8042126Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.8042550Z self=, 2025-05-07T20:33:22.8042981Z T=2048, 2025-05-07T20:33:22.8043181Z D=5120, 2025-05-07T20:33:22.8043381Z scale_ub=1200.0, 2025-05-07T20:33:22.8043617Z contiguous=True, 2025-05-07T20:33:22.8043851Z compiled=True, 2025-05-07T20:33:22.8044060Z ) 2025-05-07T20:33:22.8044398Z self = 2025-05-07T20:33:22.8044920Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:22.8045205Z 2025-05-07T20:33:22.8045287Z @given( 2025-05-07T20:33:22.8045532Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:22.8045865Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:22.8046187Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:22.8046525Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:22.8046870Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:22.8047175Z ) 2025-05-07T20:33:22.8047534Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:22.8048002Z def test_silu_mul_quant( 2025-05-07T20:33:22.8048258Z self, 2025-05-07T20:33:22.8048460Z T: int, 2025-05-07T20:33:22.8048671Z D: int, 2025-05-07T20:33:22.8048894Z scale_ub: Optional[float], 2025-05-07T20:33:22.8049169Z contiguous: bool, 2025-05-07T20:33:22.8049420Z compiled: bool, 2025-05-07T20:33:22.8049645Z ) -> None: 2025-05-07T20:33:22.8049858Z torch.manual_seed(2025) 2025-05-07T20:33:22.8050109Z 2025-05-07T20:33:22.8050391Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:22.8050745Z 2025-05-07T20:33:22.8050952Z x_sign = torch.sign(x) 2025-05-07T20:33:22.8051256Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:22.8051583Z x = x_sign * x_clamp 2025-05-07T20:33:22.8051828Z x0 = x[:, :D] 2025-05-07T20:33:22.8052055Z x1 = x[:, D:] 2025-05-07T20:33:22.8052269Z 2025-05-07T20:33:22.8052506Z if contiguous: 2025-05-07T20:33:22.8052745Z x0 = x0.contiguous() 2025-05-07T20:33:22.8053010Z x1 = x1.contiguous() 2025-05-07T20:33:22.8053247Z 2025-05-07T20:33:22.8053443Z if scale_ub is not None: 2025-05-07T20:33:22.8053717Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:22.8054094Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:22.8054509Z ) 2025-05-07T20:33:22.8054712Z else: 2025-05-07T20:33:22.8054920Z scale_ub_tensor = None 2025-05-07T20:33:22.8055222Z 2025-05-07T20:33:22.8055457Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.8055773Z op = silu_mul_quant 2025-05-07T20:33:22.8056029Z if compiled: 2025-05-07T20:33:22.8056280Z op = torch.compile(op) 2025-05-07T20:33:22.8056582Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:22.8056859Z 2025-05-07T20:33:22.8057056Z y_fp8, y_scale = fn() 2025-05-07T20:33:22.8057352Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:22.8057640Z 2025-05-07T20:33:22.8057884Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:22.8058233Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:22.8058534Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:22.8058862Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:22.8059284Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.8059597Z 2025-05-07T20:33:22.8059807Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:22.8060010Z 2025-05-07T20:33:22.8060116Z moe/activation_test.py:126: 2025-05-07T20:33:22.8060424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.8060765Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:22.8061113Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:22.8061945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:22.8062734Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:22.8063307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:22.8064024Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:22.8064754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:22.8065511Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:22.8066276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:22.8066948Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:22.8067574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:22.8068113Z fn() 2025-05-07T20:33:22.8068647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:22.8069258Z self.fn.run( 2025-05-07T20:33:22.8069744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:22.8070304Z kernel = self.compile( 2025-05-07T20:33:22.8070866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:22.8071547Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:22.8071949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:22.8072188Z 2025-05-07T20:33:22.8072397Z self = 2025-05-07T20:33:22.8073589Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:22.8075060Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09a03d56c0>} 2025-05-07T20:33:22.8086662Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:22.8087847Z context = 2025-05-07T20:33:22.8088372Z 2025-05-07T20:33:22.8088553Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:22.8089098Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:22.8089594Z module_map=module_map) 2025-05-07T20:33:22.8089978Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:22.8090359Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:22.8090630Z E ^ 2025-05-07T20:33:22.8091122Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:22.8091601Z 2025-05-07T20:33:22.8092150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:22.8092707Z 2025-05-07T20:33:22.8092821Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:22.8093267Z self=, 2025-05-07T20:33:22.8093703Z T=16384, 2025-05-07T20:33:22.8093913Z D=7168, 2025-05-07T20:33:22.8094114Z scale_ub=1200.0, 2025-05-07T20:33:22.8094448Z contiguous=False, 2025-05-07T20:33:22.8094694Z compiled=False, 2025-05-07T20:33:22.8094908Z ) 2025-05-07T20:33:23.5601066Z self = 2025-05-07T20:33:23.5601626Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:23.5601932Z 2025-05-07T20:33:23.5602036Z @given( 2025-05-07T20:33:23.5602310Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:23.5602650Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:23.5602999Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:23.5603376Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:23.5603716Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:23.5604000Z ) 2025-05-07T20:33:23.5604362Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:23.5604826Z def test_silu_mul_quant( 2025-05-07T20:33:23.5605074Z self, 2025-05-07T20:33:23.5605268Z T: int, 2025-05-07T20:33:23.5605468Z D: int, 2025-05-07T20:33:23.5605690Z scale_ub: Optional[float], 2025-05-07T20:33:23.5605966Z contiguous: bool, 2025-05-07T20:33:23.5606209Z compiled: bool, 2025-05-07T20:33:23.5606438Z ) -> None: 2025-05-07T20:33:23.5606654Z torch.manual_seed(2025) 2025-05-07T20:33:23.5606904Z 2025-05-07T20:33:23.5607190Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:23.5607576Z 2025-05-07T20:33:23.5607780Z x_sign = torch.sign(x) 2025-05-07T20:33:23.5608072Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:23.5608393Z x = x_sign * x_clamp 2025-05-07T20:33:23.5608638Z x0 = x[:, :D] 2025-05-07T20:33:23.5608847Z x1 = x[:, D:] 2025-05-07T20:33:23.5609055Z 2025-05-07T20:33:23.5609242Z if contiguous: 2025-05-07T20:33:23.5609473Z x0 = x0.contiguous() 2025-05-07T20:33:23.5609740Z x1 = x1.contiguous() 2025-05-07T20:33:23.5610143Z 2025-05-07T20:33:23.5610329Z if scale_ub is not None: 2025-05-07T20:33:23.5610611Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:23.5610955Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:23.5611267Z ) 2025-05-07T20:33:23.5611553Z else: 2025-05-07T20:33:23.5611777Z scale_ub_tensor = None 2025-05-07T20:33:23.5612030Z 2025-05-07T20:33:23.5612277Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:23.5612645Z op = silu_mul_quant 2025-05-07T20:33:23.5612983Z if compiled: 2025-05-07T20:33:23.5613231Z op = torch.compile(op) 2025-05-07T20:33:23.5613536Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:23.5613827Z 2025-05-07T20:33:23.5614016Z > y_fp8, y_scale = fn() 2025-05-07T20:33:23.5614191Z 2025-05-07T20:33:23.5614292Z moe/activation_test.py:117: 2025-05-07T20:33:23.5614710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:23.5615045Z moe/activation_test.py:115: in fn 2025-05-07T20:33:23.5615338Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:23.5616072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:23.5616804Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:23.5617440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:23.5618174Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:23.5618881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:23.5619443Z kernel = self.compile( 2025-05-07T20:33:23.5620021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:23.5620724Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:23.5621144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:23.5621384Z 2025-05-07T20:33:23.5621599Z self = 2025-05-07T20:33:23.5622749Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:23.5624194Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099b2c0180>} 2025-05-07T20:33:23.5625768Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:23.5626864Z context = 2025-05-07T20:33:23.5627168Z 2025-05-07T20:33:23.5627341Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:23.5627893Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:23.5628393Z module_map=module_map) 2025-05-07T20:33:23.5628771Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:23.5629145Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:23.5629424Z E ^ 2025-05-07T20:33:23.5629918Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:23.5630393Z 2025-05-07T20:33:23.5630832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:23.5631506Z 2025-05-07T20:33:23.5631612Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:23.5632039Z self=, 2025-05-07T20:33:23.5632453Z T=1, 2025-05-07T20:33:23.5632648Z D=7168, 2025-05-07T20:33:23.5632847Z scale_ub=None, 2025-05-07T20:33:23.5633128Z contiguous=True, 2025-05-07T20:33:23.5633360Z compiled=True, 2025-05-07T20:33:23.5633574Z ) 2025-05-07T20:33:23.5633908Z self = 2025-05-07T20:33:23.5634406Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:23.5634742Z 2025-05-07T20:33:23.5634822Z @given( 2025-05-07T20:33:23.5635060Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:23.5635379Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:23.5635699Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:23.5636044Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:23.5636385Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:23.5636691Z ) 2025-05-07T20:33:23.5637053Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:23.5637522Z def test_silu_mul_quant( 2025-05-07T20:33:23.5637767Z self, 2025-05-07T20:33:23.5637979Z T: int, 2025-05-07T20:33:23.5638192Z D: int, 2025-05-07T20:33:23.5638410Z scale_ub: Optional[float], 2025-05-07T20:33:23.5638751Z contiguous: bool, 2025-05-07T20:33:23.5638993Z compiled: bool, 2025-05-07T20:33:23.5639213Z ) -> None: 2025-05-07T20:33:23.5639431Z torch.manual_seed(2025) 2025-05-07T20:33:23.5639676Z 2025-05-07T20:33:23.5639948Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:23.5640306Z 2025-05-07T20:33:23.5640503Z x_sign = torch.sign(x) 2025-05-07T20:33:23.5640793Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:23.5641114Z x = x_sign * x_clamp 2025-05-07T20:33:23.5641357Z x0 = x[:, :D] 2025-05-07T20:33:23.5641568Z x1 = x[:, D:] 2025-05-07T20:33:23.5641784Z 2025-05-07T20:33:23.5641973Z if contiguous: 2025-05-07T20:33:23.5642201Z x0 = x0.contiguous() 2025-05-07T20:33:23.5642516Z x1 = x1.contiguous() 2025-05-07T20:33:23.5642764Z 2025-05-07T20:33:23.5642959Z if scale_ub is not None: 2025-05-07T20:33:23.5643230Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:23.5643573Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:23.5643889Z ) 2025-05-07T20:33:23.5644079Z else: 2025-05-07T20:33:23.5644292Z scale_ub_tensor = None 2025-05-07T20:33:23.5644553Z 2025-05-07T20:33:23.5644780Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:23.5645100Z op = silu_mul_quant 2025-05-07T20:33:23.5645353Z if compiled: 2025-05-07T20:33:23.5645599Z op = torch.compile(op) 2025-05-07T20:33:23.5645900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:23.5646180Z 2025-05-07T20:33:23.5646373Z y_fp8, y_scale = fn() 2025-05-07T20:33:23.5646656Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:23.5646950Z 2025-05-07T20:33:23.5647185Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:23.5647524Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:23.5647822Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:23.5648142Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:23.5648503Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:23.5648820Z 2025-05-07T20:33:23.5649030Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:23.5649230Z 2025-05-07T20:33:23.5649333Z moe/activation_test.py:126: 2025-05-07T20:33:23.5649640Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:23.5650072Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:23.5650402Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:23.5651267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:23.5652071Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:23.5652705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:23.5653419Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:23.5654188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:23.5655034Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:23.5655809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:23.5656483Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:23.5657126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:23.5657682Z fn() 2025-05-07T20:33:23.5658216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:23.5658845Z self.fn.run( 2025-05-07T20:33:23.5659385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:23.5659957Z kernel = self.compile( 2025-05-07T20:33:23.5660519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:23.5661209Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:23.5661624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:23.5661867Z 2025-05-07T20:33:23.5662090Z self = 2025-05-07T20:33:23.5663219Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:23.5664657Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099b2c0cc0>} 2025-05-07T20:33:23.5666078Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:23.5667168Z context = 2025-05-07T20:33:23.5667471Z 2025-05-07T20:33:23.5667640Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:23.5668184Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:23.5668685Z module_map=module_map) 2025-05-07T20:33:23.5669067Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:23.5669439Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:23.5669710Z E ^ 2025-05-07T20:33:23.5670192Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:23.5670662Z 2025-05-07T20:33:23.5671102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:23.5671640Z 2025-05-07T20:33:23.5671748Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:23.5672164Z self=, 2025-05-07T20:33:23.5672677Z T=4096, 2025-05-07T20:33:23.5672867Z D=5120, 2025-05-07T20:33:23.5673060Z scale_ub=None, 2025-05-07T20:33:23.5673279Z contiguous=False, 2025-05-07T20:33:23.5673507Z compiled=False, 2025-05-07T20:33:23.5673703Z ) 2025-05-07T20:33:24.3802869Z self = 2025-05-07T20:33:24.3804002Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:24.3804591Z 2025-05-07T20:33:24.3804762Z @given( 2025-05-07T20:33:24.3805236Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:24.3805982Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:24.3806603Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:24.3807271Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:24.3807933Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:24.3808494Z ) 2025-05-07T20:33:24.3809195Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:24.3810105Z def test_silu_mul_quant( 2025-05-07T20:33:24.3810572Z self, 2025-05-07T20:33:24.3810955Z T: int, 2025-05-07T20:33:24.3811344Z D: int, 2025-05-07T20:33:24.3811764Z scale_ub: Optional[float], 2025-05-07T20:33:24.3812310Z contiguous: bool, 2025-05-07T20:33:24.3812603Z compiled: bool, 2025-05-07T20:33:24.3812850Z ) -> None: 2025-05-07T20:33:24.3813139Z torch.manual_seed(2025) 2025-05-07T20:33:24.3813388Z 2025-05-07T20:33:24.3813663Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:24.3814018Z 2025-05-07T20:33:24.3814210Z x_sign = torch.sign(x) 2025-05-07T20:33:24.3814607Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:24.3814921Z x = x_sign * x_clamp 2025-05-07T20:33:24.3815161Z x0 = x[:, :D] 2025-05-07T20:33:24.3815372Z x1 = x[:, D:] 2025-05-07T20:33:24.3815578Z 2025-05-07T20:33:24.3815763Z if contiguous: 2025-05-07T20:33:24.3815994Z x0 = x0.contiguous() 2025-05-07T20:33:24.3816250Z x1 = x1.contiguous() 2025-05-07T20:33:24.3816495Z 2025-05-07T20:33:24.3816687Z if scale_ub is not None: 2025-05-07T20:33:24.3816961Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:24.3817300Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:24.3817617Z ) 2025-05-07T20:33:24.3817807Z else: 2025-05-07T20:33:24.3818025Z scale_ub_tensor = None 2025-05-07T20:33:24.3818283Z 2025-05-07T20:33:24.3818512Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.3818835Z op = silu_mul_quant 2025-05-07T20:33:24.3819088Z if compiled: 2025-05-07T20:33:24.3819331Z op = torch.compile(op) 2025-05-07T20:33:24.3819634Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.3819915Z 2025-05-07T20:33:24.3820117Z > y_fp8, y_scale = fn() 2025-05-07T20:33:24.3820284Z 2025-05-07T20:33:24.3820383Z moe/activation_test.py:117: 2025-05-07T20:33:24.3820688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.3821028Z moe/activation_test.py:115: in fn 2025-05-07T20:33:24.3821311Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.3822039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:24.3822773Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:24.3823333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:24.3824038Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:24.3824730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:24.3826010Z kernel = self.compile( 2025-05-07T20:33:24.3826643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:24.3827427Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:24.3827957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.3828225Z 2025-05-07T20:33:24.3828466Z self = 2025-05-07T20:33:24.3829784Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:24.3831548Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09a03b7240>} 2025-05-07T20:33:24.3833211Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:24.3834467Z context = 2025-05-07T20:33:24.3834810Z 2025-05-07T20:33:24.3835001Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:24.3835664Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:24.3836154Z module_map=module_map) 2025-05-07T20:33:24.3836521Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:24.3836875Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:24.3837141Z E ^ 2025-05-07T20:33:24.3837617Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:24.3838091Z 2025-05-07T20:33:24.3838532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:24.3839070Z 2025-05-07T20:33:24.3839171Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:24.3839594Z self=, 2025-05-07T20:33:24.3840012Z T=4096, 2025-05-07T20:33:24.3840193Z D=7168, 2025-05-07T20:33:24.3840386Z scale_ub=None, 2025-05-07T20:33:24.3840606Z contiguous=False, 2025-05-07T20:33:24.3840826Z compiled=False, 2025-05-07T20:33:24.3841034Z ) 2025-05-07T20:33:24.3841358Z self = 2025-05-07T20:33:24.3841868Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:24.3842150Z 2025-05-07T20:33:24.3842227Z @given( 2025-05-07T20:33:24.3842471Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:24.3842821Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:24.3843129Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:24.3843463Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:24.3843798Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:24.3844084Z ) 2025-05-07T20:33:24.3844441Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:24.3844896Z def test_silu_mul_quant( 2025-05-07T20:33:24.3845134Z self, 2025-05-07T20:33:24.3845334Z T: int, 2025-05-07T20:33:24.3845528Z D: int, 2025-05-07T20:33:24.3845747Z scale_ub: Optional[float], 2025-05-07T20:33:24.3846017Z contiguous: bool, 2025-05-07T20:33:24.3846258Z compiled: bool, 2025-05-07T20:33:24.3846480Z ) -> None: 2025-05-07T20:33:24.3846693Z torch.manual_seed(2025) 2025-05-07T20:33:24.3846939Z 2025-05-07T20:33:24.3847212Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:24.3847630Z 2025-05-07T20:33:24.3847824Z x_sign = torch.sign(x) 2025-05-07T20:33:24.3848113Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:24.3848419Z x = x_sign * x_clamp 2025-05-07T20:33:24.3848655Z x0 = x[:, :D] 2025-05-07T20:33:24.3848868Z x1 = x[:, D:] 2025-05-07T20:33:24.3849112Z 2025-05-07T20:33:24.3849302Z if contiguous: 2025-05-07T20:33:24.3849534Z x0 = x0.contiguous() 2025-05-07T20:33:24.3849790Z x1 = x1.contiguous() 2025-05-07T20:33:24.3850032Z 2025-05-07T20:33:24.3850263Z if scale_ub is not None: 2025-05-07T20:33:24.3850532Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:24.3850869Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:24.3851184Z ) 2025-05-07T20:33:24.3851375Z else: 2025-05-07T20:33:24.3851578Z scale_ub_tensor = None 2025-05-07T20:33:24.3851831Z 2025-05-07T20:33:24.3852068Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.3852388Z op = silu_mul_quant 2025-05-07T20:33:24.3852664Z if compiled: 2025-05-07T20:33:24.3852936Z op = torch.compile(op) 2025-05-07T20:33:24.3853231Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.3853508Z 2025-05-07T20:33:24.3853706Z > y_fp8, y_scale = fn() 2025-05-07T20:33:24.3853870Z 2025-05-07T20:33:24.3853967Z moe/activation_test.py:117: 2025-05-07T20:33:24.3854341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.3854751Z moe/activation_test.py:115: in fn 2025-05-07T20:33:24.3855044Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.3855761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:24.3856486Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:24.3857048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:24.3857762Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:24.3858457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:24.3859019Z kernel = self.compile( 2025-05-07T20:33:24.3859582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:24.3860264Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:24.3860678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.3860912Z 2025-05-07T20:33:24.3861129Z self = 2025-05-07T20:33:24.3862254Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:24.3863669Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099a879ee0>} 2025-05-07T20:33:24.3865074Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:24.3866158Z context = 2025-05-07T20:33:24.3866457Z 2025-05-07T20:33:24.3866632Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:24.3867162Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:24.3867646Z module_map=module_map) 2025-05-07T20:33:24.3868061Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:24.3868420Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:24.3868679Z E ^ 2025-05-07T20:33:24.3869151Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:24.3869661Z 2025-05-07T20:33:24.3870106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:24.3870647Z 2025-05-07T20:33:24.3870755Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:24.3871217Z self=, 2025-05-07T20:33:24.3871636Z T=128, 2025-05-07T20:33:24.3871828Z D=7168, 2025-05-07T20:33:24.3872011Z scale_ub=None, 2025-05-07T20:33:24.3872223Z contiguous=False, 2025-05-07T20:33:24.3872448Z compiled=True, 2025-05-07T20:33:24.3872669Z ) 2025-05-07T20:33:24.4430973Z self = 2025-05-07T20:33:24.4431509Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:24.4431836Z 2025-05-07T20:33:24.4431934Z @given( 2025-05-07T20:33:24.4432166Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:24.4432499Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:24.4432820Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:24.4433167Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:24.4433592Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:24.4441196Z ) 2025-05-07T20:33:24.4441591Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:24.4442056Z def test_silu_mul_quant( 2025-05-07T20:33:24.4442311Z self, 2025-05-07T20:33:24.4442515Z T: int, 2025-05-07T20:33:24.4442714Z D: int, 2025-05-07T20:33:24.4442948Z scale_ub: Optional[float], 2025-05-07T20:33:24.4443242Z contiguous: bool, 2025-05-07T20:33:24.4443485Z compiled: bool, 2025-05-07T20:33:24.4443719Z ) -> None: 2025-05-07T20:33:24.4443949Z torch.manual_seed(2025) 2025-05-07T20:33:24.4444201Z 2025-05-07T20:33:24.4444488Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:24.4444855Z 2025-05-07T20:33:24.4445053Z x_sign = torch.sign(x) 2025-05-07T20:33:24.4445357Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:24.4445688Z x = x_sign * x_clamp 2025-05-07T20:33:24.4445931Z x0 = x[:, :D] 2025-05-07T20:33:24.4446152Z x1 = x[:, D:] 2025-05-07T20:33:24.4446366Z 2025-05-07T20:33:24.4446554Z if contiguous: 2025-05-07T20:33:24.4446798Z x0 = x0.contiguous() 2025-05-07T20:33:24.4447069Z x1 = x1.contiguous() 2025-05-07T20:33:24.4447325Z 2025-05-07T20:33:24.4447520Z if scale_ub is not None: 2025-05-07T20:33:24.4447807Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:24.4448162Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:24.4448481Z ) 2025-05-07T20:33:24.4448687Z else: 2025-05-07T20:33:24.4448908Z scale_ub_tensor = None 2025-05-07T20:33:24.4449163Z 2025-05-07T20:33:24.4449405Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.4449733Z op = silu_mul_quant 2025-05-07T20:33:24.4449985Z if compiled: 2025-05-07T20:33:24.4450243Z op = torch.compile(op) 2025-05-07T20:33:24.4450551Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.4450834Z 2025-05-07T20:33:24.4451035Z y_fp8, y_scale = fn() 2025-05-07T20:33:24.4451332Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:24.4451629Z 2025-05-07T20:33:24.4451871Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.4452221Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:24.4452637Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:24.4452990Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:24.4453375Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:24.4453702Z 2025-05-07T20:33:24.4453969Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:24.4454181Z 2025-05-07T20:33:24.4454287Z moe/activation_test.py:126: 2025-05-07T20:33:24.4454720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.4455060Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:24.4455470Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:24.4456300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:24.4457100Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:24.4457667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:24.4458390Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:24.4459125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:24.4459890Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:24.4460699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:24.4461471Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:24.4462192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:24.4462808Z fn() 2025-05-07T20:33:24.4463414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:24.4464117Z self.fn.run( 2025-05-07T20:33:24.4464667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:24.4465290Z kernel = self.compile( 2025-05-07T20:33:24.4465931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:24.4466716Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:24.4467175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.4467450Z 2025-05-07T20:33:24.4467687Z self = 2025-05-07T20:33:24.4469031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:24.4470740Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099a7f9120>} 2025-05-07T20:33:24.4472408Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:24.4473668Z context = 2025-05-07T20:33:24.4474011Z 2025-05-07T20:33:24.4474198Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:24.4474818Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:24.4475371Z module_map=module_map) 2025-05-07T20:33:24.4475784Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:24.4476189Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:24.4476542Z E ^ 2025-05-07T20:33:24.4477085Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:24.4477640Z 2025-05-07T20:33:24.4478146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:24.4478819Z 2025-05-07T20:33:24.4478934Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:24.4479418Z self=, 2025-05-07T20:33:24.4479888Z T=128, 2025-05-07T20:33:24.4480133Z D=7168, 2025-05-07T20:33:24.4480345Z scale_ub=None, 2025-05-07T20:33:24.4480580Z contiguous=False, 2025-05-07T20:33:24.4480830Z compiled=False, 2025-05-07T20:33:24.4481056Z ) 2025-05-07T20:33:24.6467609Z self = 2025-05-07T20:33:24.6468163Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:24.6468470Z 2025-05-07T20:33:24.6468551Z @given( 2025-05-07T20:33:24.6468823Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:24.6469148Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:24.6469451Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:24.6469790Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:24.6470132Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:24.6470417Z ) 2025-05-07T20:33:24.6470881Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:24.6471349Z def test_silu_mul_quant( 2025-05-07T20:33:24.6471595Z self, 2025-05-07T20:33:24.6471806Z T: int, 2025-05-07T20:33:24.6472015Z D: int, 2025-05-07T20:33:24.6472234Z scale_ub: Optional[float], 2025-05-07T20:33:24.6472506Z contiguous: bool, 2025-05-07T20:33:24.6472749Z compiled: bool, 2025-05-07T20:33:24.6472977Z ) -> None: 2025-05-07T20:33:24.6473190Z torch.manual_seed(2025) 2025-05-07T20:33:24.6473429Z 2025-05-07T20:33:24.6473711Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:24.6474058Z 2025-05-07T20:33:24.6474255Z x_sign = torch.sign(x) 2025-05-07T20:33:24.6474550Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:24.6474858Z x = x_sign * x_clamp 2025-05-07T20:33:24.6475099Z x0 = x[:, :D] 2025-05-07T20:33:24.6475308Z x1 = x[:, D:] 2025-05-07T20:33:24.6475509Z 2025-05-07T20:33:24.6475692Z if contiguous: 2025-05-07T20:33:24.6475929Z x0 = x0.contiguous() 2025-05-07T20:33:24.6476181Z x1 = x1.contiguous() 2025-05-07T20:33:24.6476431Z 2025-05-07T20:33:24.6476625Z if scale_ub is not None: 2025-05-07T20:33:24.6476902Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:24.6477242Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:24.6477552Z ) 2025-05-07T20:33:24.6477744Z else: 2025-05-07T20:33:24.6477948Z scale_ub_tensor = None 2025-05-07T20:33:24.6478203Z 2025-05-07T20:33:24.6478433Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.6478746Z op = silu_mul_quant 2025-05-07T20:33:24.6479004Z if compiled: 2025-05-07T20:33:24.6479250Z op = torch.compile(op) 2025-05-07T20:33:24.6479545Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.6479821Z 2025-05-07T20:33:24.6480020Z > y_fp8, y_scale = fn() 2025-05-07T20:33:24.6480186Z 2025-05-07T20:33:24.6480284Z moe/activation_test.py:117: 2025-05-07T20:33:24.6480580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.6480915Z moe/activation_test.py:115: in fn 2025-05-07T20:33:24.6481199Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.6481913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:24.6482712Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:24.6483268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:24.6484034Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:24.6484759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:24.6485318Z kernel = self.compile( 2025-05-07T20:33:24.6485942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:24.6486629Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:24.6487033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.6487273Z 2025-05-07T20:33:24.6487485Z self = 2025-05-07T20:33:24.6488610Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:24.6490038Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099a4b8b80>} 2025-05-07T20:33:24.6491476Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:24.6492567Z context = 2025-05-07T20:33:24.6492875Z 2025-05-07T20:33:24.6493047Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:24.6493589Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:24.6494074Z module_map=module_map) 2025-05-07T20:33:24.6494517Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:24.6494885Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:24.6495155Z E ^ 2025-05-07T20:33:24.6495633Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:24.6496108Z 2025-05-07T20:33:24.6496546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:24.6497094Z 2025-05-07T20:33:24.6497209Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:24.6497635Z self=, 2025-05-07T20:33:24.6498050Z T=4096, 2025-05-07T20:33:24.6498244Z D=5120, 2025-05-07T20:33:24.6498441Z scale_ub=1200.0, 2025-05-07T20:33:24.6498664Z contiguous=True, 2025-05-07T20:33:24.6498895Z compiled=False, 2025-05-07T20:33:24.6499113Z ) 2025-05-07T20:33:24.6499438Z self = 2025-05-07T20:33:24.6499958Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:24.6500244Z 2025-05-07T20:33:24.6500329Z @given( 2025-05-07T20:33:24.6500556Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:24.6500885Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:24.6501199Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:24.6501533Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:24.6501875Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:24.6502171Z ) 2025-05-07T20:33:24.6502534Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:24.6502996Z def test_silu_mul_quant( 2025-05-07T20:33:24.6503297Z self, 2025-05-07T20:33:24.6503493Z T: int, 2025-05-07T20:33:24.6503696Z D: int, 2025-05-07T20:33:24.6503912Z scale_ub: Optional[float], 2025-05-07T20:33:24.6504180Z contiguous: bool, 2025-05-07T20:33:24.6504426Z compiled: bool, 2025-05-07T20:33:24.6504638Z ) -> None: 2025-05-07T20:33:24.6504890Z torch.manual_seed(2025) 2025-05-07T20:33:24.6505135Z 2025-05-07T20:33:24.6505408Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:24.6505768Z 2025-05-07T20:33:24.6505968Z x_sign = torch.sign(x) 2025-05-07T20:33:24.6506325Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:24.6506647Z x = x_sign * x_clamp 2025-05-07T20:33:24.6506887Z x0 = x[:, :D] 2025-05-07T20:33:24.6507098Z x1 = x[:, D:] 2025-05-07T20:33:24.6507309Z 2025-05-07T20:33:24.6507502Z if contiguous: 2025-05-07T20:33:24.6507732Z x0 = x0.contiguous() 2025-05-07T20:33:24.6508001Z x1 = x1.contiguous() 2025-05-07T20:33:24.6508249Z 2025-05-07T20:33:24.6508438Z if scale_ub is not None: 2025-05-07T20:33:24.6508716Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:24.6509059Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:24.6509388Z ) 2025-05-07T20:33:24.6509584Z else: 2025-05-07T20:33:24.6509792Z scale_ub_tensor = None 2025-05-07T20:33:24.6510046Z 2025-05-07T20:33:24.6510313Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:24.6510640Z op = silu_mul_quant 2025-05-07T20:33:24.6510891Z if compiled: 2025-05-07T20:33:24.6511130Z op = torch.compile(op) 2025-05-07T20:33:24.6511424Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.6511700Z 2025-05-07T20:33:24.6511885Z > y_fp8, y_scale = fn() 2025-05-07T20:33:24.6512055Z 2025-05-07T20:33:24.6512151Z moe/activation_test.py:117: 2025-05-07T20:33:24.6512453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.6512780Z moe/activation_test.py:115: in fn 2025-05-07T20:33:24.6513060Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:24.6513779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:24.6514500Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:24.6515051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:24.6515762Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:24.6516452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:24.6517004Z kernel = self.compile( 2025-05-07T20:33:24.6517552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:24.6518235Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:24.6518633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:24.6518867Z 2025-05-07T20:33:24.6519075Z self = 2025-05-07T20:33:24.6520191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:24.6521607Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099a4b9b20>} 2025-05-07T20:33:24.6523007Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:24.6524137Z context = 2025-05-07T20:33:24.6524438Z 2025-05-07T20:33:24.6524610Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:24.6525188Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:24.6525847Z module_map=module_map) 2025-05-07T20:33:24.6526227Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:24.6526662Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:24.6526929Z E ^ 2025-05-07T20:33:24.6527406Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:24.6527878Z 2025-05-07T20:33:24.6528314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:24.6528864Z 2025-05-07T20:33:24.6528969Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:24.6529393Z self=, 2025-05-07T20:33:24.6529808Z T=1, 2025-05-07T20:33:24.6529991Z D=5120, 2025-05-07T20:33:24.6530189Z scale_ub=None, 2025-05-07T20:33:24.6530410Z contiguous=True, 2025-05-07T20:33:24.6530634Z compiled=True, 2025-05-07T20:33:24.6530840Z ) 2025-05-07T20:33:25.0445319Z self = 2025-05-07T20:33:25.0446710Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:25.0447258Z 2025-05-07T20:33:25.0447439Z @given( 2025-05-07T20:33:25.0447912Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:25.0448568Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:25.0449201Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:25.0449874Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:25.0450560Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:25.0451142Z ) 2025-05-07T20:33:25.0451854Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:25.0452598Z def test_silu_mul_quant( 2025-05-07T20:33:25.0452882Z self, 2025-05-07T20:33:25.0453090Z T: int, 2025-05-07T20:33:25.0453288Z D: int, 2025-05-07T20:33:25.0453515Z scale_ub: Optional[float], 2025-05-07T20:33:25.0453805Z contiguous: bool, 2025-05-07T20:33:25.0454056Z compiled: bool, 2025-05-07T20:33:25.0454283Z ) -> None: 2025-05-07T20:33:25.0454625Z torch.manual_seed(2025) 2025-05-07T20:33:25.0454864Z 2025-05-07T20:33:25.0455142Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:25.0455493Z 2025-05-07T20:33:25.0455684Z x_sign = torch.sign(x) 2025-05-07T20:33:25.0455978Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:25.0456297Z x = x_sign * x_clamp 2025-05-07T20:33:25.0456536Z x0 = x[:, :D] 2025-05-07T20:33:25.0456759Z x1 = x[:, D:] 2025-05-07T20:33:25.0456969Z 2025-05-07T20:33:25.0457157Z if contiguous: 2025-05-07T20:33:25.0457399Z x0 = x0.contiguous() 2025-05-07T20:33:25.0457669Z x1 = x1.contiguous() 2025-05-07T20:33:25.0457924Z 2025-05-07T20:33:25.0458118Z if scale_ub is not None: 2025-05-07T20:33:25.0458407Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:25.0458753Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:25.0459073Z ) 2025-05-07T20:33:25.0459269Z else: 2025-05-07T20:33:25.0459490Z scale_ub_tensor = None 2025-05-07T20:33:25.0459748Z 2025-05-07T20:33:25.0459985Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.0460301Z op = silu_mul_quant 2025-05-07T20:33:25.0460547Z if compiled: 2025-05-07T20:33:25.0460886Z op = torch.compile(op) 2025-05-07T20:33:25.0461188Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.0461464Z 2025-05-07T20:33:25.0461660Z y_fp8, y_scale = fn() 2025-05-07T20:33:25.0461951Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:25.0462324Z 2025-05-07T20:33:25.0462560Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.0462902Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:25.0463202Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:25.0463580Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:25.0463947Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:25.0464265Z 2025-05-07T20:33:25.0464458Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:25.0464664Z 2025-05-07T20:33:25.0464763Z moe/activation_test.py:126: 2025-05-07T20:33:25.0465061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.0465403Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:25.0465729Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:25.0466559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:25.0467354Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:25.0467986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:25.0468716Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:25.0469446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:25.0470210Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:25.0470976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:25.0471658Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:25.0472292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:25.0472836Z fn() 2025-05-07T20:33:25.0473358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:25.0473984Z self.fn.run( 2025-05-07T20:33:25.0474473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:25.0475024Z kernel = self.compile( 2025-05-07T20:33:25.0475586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:25.0476270Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:25.0476671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.0476917Z 2025-05-07T20:33:25.0477128Z self = 2025-05-07T20:33:25.0478261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:25.0479699Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099a4baca0>} 2025-05-07T20:33:25.0481113Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:25.0482192Z context = 2025-05-07T20:33:25.0482584Z 2025-05-07T20:33:25.0482778Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:25.0483320Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:25.0483928Z module_map=module_map) 2025-05-07T20:33:25.0484302Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:25.0484676Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:25.0484956Z E ^ 2025-05-07T20:33:25.0485430Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:25.0485950Z 2025-05-07T20:33:25.0486391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:25.0486941Z 2025-05-07T20:33:25.0487046Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:25.0487474Z self=, 2025-05-07T20:33:25.0487899Z T=2048, 2025-05-07T20:33:25.0488100Z D=5120, 2025-05-07T20:33:25.0488307Z scale_ub=None, 2025-05-07T20:33:25.0488514Z contiguous=True, 2025-05-07T20:33:25.0488739Z compiled=True, 2025-05-07T20:33:25.0488946Z ) 2025-05-07T20:33:25.4246051Z self = 2025-05-07T20:33:25.4246630Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:25.4247131Z 2025-05-07T20:33:25.4247230Z @given( 2025-05-07T20:33:25.4247471Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:25.4247801Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:25.4248119Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:25.4248468Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:25.4248807Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:25.4249105Z ) 2025-05-07T20:33:25.4249458Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:25.4249909Z def test_silu_mul_quant( 2025-05-07T20:33:25.4250155Z self, 2025-05-07T20:33:25.4250350Z T: int, 2025-05-07T20:33:25.4250548Z D: int, 2025-05-07T20:33:25.4250764Z scale_ub: Optional[float], 2025-05-07T20:33:25.4256994Z contiguous: bool, 2025-05-07T20:33:25.4257281Z compiled: bool, 2025-05-07T20:33:25.4257509Z ) -> None: 2025-05-07T20:33:25.4257731Z torch.manual_seed(2025) 2025-05-07T20:33:25.4257980Z 2025-05-07T20:33:25.4258249Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:25.4258601Z 2025-05-07T20:33:25.4258794Z x_sign = torch.sign(x) 2025-05-07T20:33:25.4259083Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:25.4259407Z x = x_sign * x_clamp 2025-05-07T20:33:25.4259648Z x0 = x[:, :D] 2025-05-07T20:33:25.4259861Z x1 = x[:, D:] 2025-05-07T20:33:25.4260067Z 2025-05-07T20:33:25.4260245Z if contiguous: 2025-05-07T20:33:25.4260468Z x0 = x0.contiguous() 2025-05-07T20:33:25.4260732Z x1 = x1.contiguous() 2025-05-07T20:33:25.4260972Z 2025-05-07T20:33:25.4261163Z if scale_ub is not None: 2025-05-07T20:33:25.4261436Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:25.4261773Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:25.4262084Z ) 2025-05-07T20:33:25.4262267Z else: 2025-05-07T20:33:25.4262476Z scale_ub_tensor = None 2025-05-07T20:33:25.4262728Z 2025-05-07T20:33:25.4262952Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.4263322Z op = silu_mul_quant 2025-05-07T20:33:25.4263575Z if compiled: 2025-05-07T20:33:25.4263818Z op = torch.compile(op) 2025-05-07T20:33:25.4264121Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.4264513Z 2025-05-07T20:33:25.4264696Z y_fp8, y_scale = fn() 2025-05-07T20:33:25.4264988Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:25.4265285Z 2025-05-07T20:33:25.4265512Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.4265919Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:25.4266221Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:25.4266542Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:25.4266904Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:25.4267279Z 2025-05-07T20:33:25.4267475Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:25.4267673Z 2025-05-07T20:33:25.4267773Z moe/activation_test.py:126: 2025-05-07T20:33:25.4268071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.4268411Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:25.4268738Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:25.4269561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:25.4270361Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:25.4270931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:25.4271680Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:25.4272401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:25.4273156Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:25.4273918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:25.4274580Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:25.4275204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:25.4275745Z fn() 2025-05-07T20:33:25.4276264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:25.4276880Z self.fn.run( 2025-05-07T20:33:25.4277363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:25.4277918Z kernel = self.compile( 2025-05-07T20:33:25.4278475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:25.4279157Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:25.4279564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.4279801Z 2025-05-07T20:33:25.4280014Z self = 2025-05-07T20:33:25.4281134Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:25.4282561Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099a565e40>} 2025-05-07T20:33:25.4283967Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:25.4285043Z context = 2025-05-07T20:33:25.4285346Z 2025-05-07T20:33:25.4285514Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:25.4286097Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:25.4286582Z module_map=module_map) 2025-05-07T20:33:25.4286956Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:25.4287314Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:25.4287626Z E ^ 2025-05-07T20:33:25.4288104Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:25.4288578Z 2025-05-07T20:33:25.4289013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:25.4289605Z 2025-05-07T20:33:25.4289714Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:25.4290145Z self=, 2025-05-07T20:33:25.4290567Z T=128, 2025-05-07T20:33:25.4290753Z D=5120, 2025-05-07T20:33:25.4290953Z scale_ub=None, 2025-05-07T20:33:25.4291170Z contiguous=True, 2025-05-07T20:33:25.4291394Z compiled=True, 2025-05-07T20:33:25.4291599Z ) 2025-05-07T20:33:25.8700993Z self = 2025-05-07T20:33:25.8701582Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:25.8701864Z 2025-05-07T20:33:25.8701958Z @given( 2025-05-07T20:33:25.8702198Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:25.8702673Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:25.8703014Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:25.8703406Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:25.8703754Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:25.8704062Z ) 2025-05-07T20:33:25.8704422Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:25.8704874Z def test_silu_mul_quant( 2025-05-07T20:33:25.8705125Z self, 2025-05-07T20:33:25.8705318Z T: int, 2025-05-07T20:33:25.8705510Z D: int, 2025-05-07T20:33:25.8705732Z scale_ub: Optional[float], 2025-05-07T20:33:25.8706007Z contiguous: bool, 2025-05-07T20:33:25.8706241Z compiled: bool, 2025-05-07T20:33:25.8706475Z ) -> None: 2025-05-07T20:33:25.8706692Z torch.manual_seed(2025) 2025-05-07T20:33:25.8706929Z 2025-05-07T20:33:25.8707215Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:25.8707573Z 2025-05-07T20:33:25.8707774Z x_sign = torch.sign(x) 2025-05-07T20:33:25.8708061Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:25.8708383Z x = x_sign * x_clamp 2025-05-07T20:33:25.8708632Z x0 = x[:, :D] 2025-05-07T20:33:25.8708846Z x1 = x[:, D:] 2025-05-07T20:33:25.8709064Z 2025-05-07T20:33:25.8709259Z if contiguous: 2025-05-07T20:33:25.8709492Z x0 = x0.contiguous() 2025-05-07T20:33:25.8709756Z x1 = x1.contiguous() 2025-05-07T20:33:25.8710013Z 2025-05-07T20:33:25.8710197Z if scale_ub is not None: 2025-05-07T20:33:25.8710477Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:25.8710820Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:25.8711134Z ) 2025-05-07T20:33:25.8711331Z else: 2025-05-07T20:33:25.8711546Z scale_ub_tensor = None 2025-05-07T20:33:25.8711800Z 2025-05-07T20:33:25.8712045Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.8712370Z op = silu_mul_quant 2025-05-07T20:33:25.8712615Z if compiled: 2025-05-07T20:33:25.8712869Z op = torch.compile(op) 2025-05-07T20:33:25.8713177Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:25.8713506Z 2025-05-07T20:33:25.8713693Z y_fp8, y_scale = fn() 2025-05-07T20:33:25.8713978Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:25.8714359Z 2025-05-07T20:33:25.8714593Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:25.8714942Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:25.8715247Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:25.8715633Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:25.8716003Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:25.8716321Z 2025-05-07T20:33:25.8716520Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:25.8716729Z 2025-05-07T20:33:25.8716898Z moe/activation_test.py:126: 2025-05-07T20:33:25.8717208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.8717559Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:25.8717887Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:25.8718718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:25.8719520Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:25.8720080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:25.8720800Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:25.8721571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:25.8722336Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:25.8723111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:25.8723802Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:25.8724439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:25.8724992Z fn() 2025-05-07T20:33:25.8725676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:25.8726303Z self.fn.run( 2025-05-07T20:33:25.8726801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:25.8727355Z kernel = self.compile( 2025-05-07T20:33:25.8727930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:25.8728627Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:25.8729045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:25.8729292Z 2025-05-07T20:33:25.8729507Z self = 2025-05-07T20:33:25.8730645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:25.8732096Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e5902ac0>} 2025-05-07T20:33:25.8733512Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:25.8734669Z context = 2025-05-07T20:33:25.8734970Z 2025-05-07T20:33:25.8735140Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:25.8735685Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:25.8736171Z module_map=module_map) 2025-05-07T20:33:25.8736617Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:25.8736988Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:25.8737273Z E ^ 2025-05-07T20:33:25.8737812Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:25.8738288Z 2025-05-07T20:33:25.8738727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:25.8739282Z 2025-05-07T20:33:25.8739484Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:25.8739919Z self=, 2025-05-07T20:33:25.8740334Z T=4096, 2025-05-07T20:33:25.8740538Z D=5120, 2025-05-07T20:33:25.8740735Z scale_ub=None, 2025-05-07T20:33:25.8740949Z contiguous=True, 2025-05-07T20:33:25.8741174Z compiled=True, 2025-05-07T20:33:25.8741382Z ) 2025-05-07T20:33:26.3158418Z self = 2025-05-07T20:33:26.3158969Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:26.3159255Z 2025-05-07T20:33:26.3159338Z @given( 2025-05-07T20:33:26.3159596Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.3159927Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.3160236Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.3160708Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.3161054Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.3161345Z ) 2025-05-07T20:33:26.3161694Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.3162149Z def test_silu_mul_quant( 2025-05-07T20:33:26.3162395Z self, 2025-05-07T20:33:26.3162585Z T: int, 2025-05-07T20:33:26.3162788Z D: int, 2025-05-07T20:33:26.3163015Z scale_ub: Optional[float], 2025-05-07T20:33:26.3163286Z contiguous: bool, 2025-05-07T20:33:26.3163532Z compiled: bool, 2025-05-07T20:33:26.3163758Z ) -> None: 2025-05-07T20:33:26.3163971Z torch.manual_seed(2025) 2025-05-07T20:33:26.3164220Z 2025-05-07T20:33:26.3164504Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.3164882Z 2025-05-07T20:33:26.3165084Z x_sign = torch.sign(x) 2025-05-07T20:33:26.3165393Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.3165722Z x = x_sign * x_clamp 2025-05-07T20:33:26.3165978Z x0 = x[:, :D] 2025-05-07T20:33:26.3166210Z x1 = x[:, D:] 2025-05-07T20:33:26.3166431Z 2025-05-07T20:33:26.3166621Z if contiguous: 2025-05-07T20:33:26.3166869Z x0 = x0.contiguous() 2025-05-07T20:33:26.3167150Z x1 = x1.contiguous() 2025-05-07T20:33:26.3167404Z 2025-05-07T20:33:26.3167604Z if scale_ub is not None: 2025-05-07T20:33:26.3167897Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.3168249Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.3168567Z ) 2025-05-07T20:33:26.3168775Z else: 2025-05-07T20:33:26.3169003Z scale_ub_tensor = None 2025-05-07T20:33:26.3169263Z 2025-05-07T20:33:26.3169513Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.3169844Z op = silu_mul_quant 2025-05-07T20:33:26.3170104Z if compiled: 2025-05-07T20:33:26.3170361Z op = torch.compile(op) 2025-05-07T20:33:26.3170675Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.3170957Z 2025-05-07T20:33:26.3171154Z y_fp8, y_scale = fn() 2025-05-07T20:33:26.3171455Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:26.3171770Z 2025-05-07T20:33:26.3172014Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.3172448Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:26.3172759Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:26.3173082Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:26.3173461Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:26.3173790Z 2025-05-07T20:33:26.3174092Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:26.3174307Z 2025-05-07T20:33:26.3174517Z moe/activation_test.py:126: 2025-05-07T20:33:26.3174834Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.3175263Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:26.3175604Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:26.3176443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:26.3177248Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:26.3177820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.3178541Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.3179275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:26.3180039Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:26.3180853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:26.3181536Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:26.3182176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:26.3182730Z fn() 2025-05-07T20:33:26.3183256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:26.3183883Z self.fn.run( 2025-05-07T20:33:26.3184553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.3185117Z kernel = self.compile( 2025-05-07T20:33:26.3185699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.3186405Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.3186833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.3187095Z 2025-05-07T20:33:26.3187312Z self = 2025-05-07T20:33:26.3188460Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.3189934Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e5d734c0>} 2025-05-07T20:33:26.3191369Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.3192466Z context = 2025-05-07T20:33:26.3192784Z 2025-05-07T20:33:26.3192962Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.3193521Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.3194030Z module_map=module_map) 2025-05-07T20:33:26.3194412Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.3194794Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:26.3195142Z E ^ 2025-05-07T20:33:26.3195626Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.3196120Z 2025-05-07T20:33:26.3196603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:26.3197160Z 2025-05-07T20:33:26.3197271Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:26.3197716Z self=, 2025-05-07T20:33:26.3198181Z T=16384, 2025-05-07T20:33:26.3198400Z D=5120, 2025-05-07T20:33:26.3198615Z scale_ub=None, 2025-05-07T20:33:26.3198837Z contiguous=True, 2025-05-07T20:33:26.3199081Z compiled=True, 2025-05-07T20:33:26.3199301Z ) 2025-05-07T20:33:26.3459530Z W0507 20:33:26.343000 96495 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:33:26.3462139Z W0507 20:33:26.343000 96495 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:33:26.3464117Z W0507 20:33:26.343000 96495 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:33:26.3465258Z W0507 20:33:26.343000 96495 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:33:26.3466420Z W0507 20:33:26.343000 96495 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:33:26.4340479Z self = 2025-05-07T20:33:26.4341045Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:26.4341341Z 2025-05-07T20:33:26.4341425Z @given( 2025-05-07T20:33:26.4341670Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.4341997Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.4342314Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.4342660Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.4343003Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.4343297Z ) 2025-05-07T20:33:26.4343655Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.4344118Z def test_silu_mul_quant( 2025-05-07T20:33:26.4344360Z self, 2025-05-07T20:33:26.4344559Z T: int, 2025-05-07T20:33:26.4344758Z D: int, 2025-05-07T20:33:26.4344978Z scale_ub: Optional[float], 2025-05-07T20:33:26.4345248Z contiguous: bool, 2025-05-07T20:33:26.4345490Z compiled: bool, 2025-05-07T20:33:26.4345723Z ) -> None: 2025-05-07T20:33:26.4345935Z torch.manual_seed(2025) 2025-05-07T20:33:26.4346176Z 2025-05-07T20:33:26.4346455Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.4346803Z 2025-05-07T20:33:26.4347003Z x_sign = torch.sign(x) 2025-05-07T20:33:26.4347299Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.4347608Z x = x_sign * x_clamp 2025-05-07T20:33:26.4347850Z x0 = x[:, :D] 2025-05-07T20:33:26.4348068Z x1 = x[:, D:] 2025-05-07T20:33:26.4348268Z 2025-05-07T20:33:26.4348457Z if contiguous: 2025-05-07T20:33:26.4348690Z x0 = x0.contiguous() 2025-05-07T20:33:26.4348944Z x1 = x1.contiguous() 2025-05-07T20:33:26.4349190Z 2025-05-07T20:33:26.4349382Z if scale_ub is not None: 2025-05-07T20:33:26.4349652Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.4350000Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.4350425Z ) 2025-05-07T20:33:26.4350621Z else: 2025-05-07T20:33:26.4350828Z scale_ub_tensor = None 2025-05-07T20:33:26.4351088Z 2025-05-07T20:33:26.4351320Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.4351698Z op = silu_mul_quant 2025-05-07T20:33:26.4351958Z if compiled: 2025-05-07T20:33:26.4352211Z op = torch.compile(op) 2025-05-07T20:33:26.4352511Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.4352795Z 2025-05-07T20:33:26.4353052Z y_fp8, y_scale = fn() 2025-05-07T20:33:26.4353336Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:26.4353634Z 2025-05-07T20:33:26.4353871Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.4354207Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:26.4354508Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:26.4354829Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:26.4355200Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:26.4355509Z 2025-05-07T20:33:26.4355714Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:26.4355913Z 2025-05-07T20:33:26.4356017Z moe/activation_test.py:126: 2025-05-07T20:33:26.4356313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.4356660Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:26.4357054Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:26.4357877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:26.4358669Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:26.4359244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.4359957Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.4360675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:26.4361433Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:26.4362202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:26.4362872Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:26.4363499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:26.4364041Z fn() 2025-05-07T20:33:26.4364575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:26.4365184Z self.fn.run( 2025-05-07T20:33:26.4365668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.4366231Z kernel = self.compile( 2025-05-07T20:33:26.4366801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.4367485Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.4367894Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.4368133Z 2025-05-07T20:33:26.4368352Z self = 2025-05-07T20:33:26.4369482Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.4370908Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e52b9580>} 2025-05-07T20:33:26.4372377Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.4373515Z context = 2025-05-07T20:33:26.4373816Z 2025-05-07T20:33:26.4373994Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.4374663Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.4375198Z module_map=module_map) 2025-05-07T20:33:26.4375578Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.4375961Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:26.4376237Z E ^ 2025-05-07T20:33:26.4384090Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.4384618Z 2025-05-07T20:33:26.4385079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:26.4385634Z 2025-05-07T20:33:26.4385751Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:26.4386183Z self=, 2025-05-07T20:33:26.4386615Z T=1, 2025-05-07T20:33:26.4386809Z D=5120, 2025-05-07T20:33:26.4387076Z scale_ub=1200.0, 2025-05-07T20:33:26.4387316Z contiguous=True, 2025-05-07T20:33:26.4387559Z compiled=True, 2025-05-07T20:33:26.4387767Z ) 2025-05-07T20:33:26.5816819Z self = 2025-05-07T20:33:26.5817391Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:26.5817678Z 2025-05-07T20:33:26.5817766Z @given( 2025-05-07T20:33:26.5818014Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.5818349Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.5818673Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.5819022Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.5819361Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.5819667Z ) 2025-05-07T20:33:26.5820025Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.5820488Z def test_silu_mul_quant( 2025-05-07T20:33:26.5820738Z self, 2025-05-07T20:33:26.5820938Z T: int, 2025-05-07T20:33:26.5821135Z D: int, 2025-05-07T20:33:26.5821356Z scale_ub: Optional[float], 2025-05-07T20:33:26.5821640Z contiguous: bool, 2025-05-07T20:33:26.5821888Z compiled: bool, 2025-05-07T20:33:26.5822106Z ) -> None: 2025-05-07T20:33:26.5822322Z torch.manual_seed(2025) 2025-05-07T20:33:26.5822565Z 2025-05-07T20:33:26.5822841Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.5823197Z 2025-05-07T20:33:26.5823393Z x_sign = torch.sign(x) 2025-05-07T20:33:26.5823682Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.5824006Z x = x_sign * x_clamp 2025-05-07T20:33:26.5824252Z x0 = x[:, :D] 2025-05-07T20:33:26.5824465Z x1 = x[:, D:] 2025-05-07T20:33:26.5824672Z 2025-05-07T20:33:26.5824862Z if contiguous: 2025-05-07T20:33:26.5825091Z x0 = x0.contiguous() 2025-05-07T20:33:26.5825362Z x1 = x1.contiguous() 2025-05-07T20:33:26.5825772Z 2025-05-07T20:33:26.5825962Z if scale_ub is not None: 2025-05-07T20:33:26.5826243Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.5826586Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.5826901Z ) 2025-05-07T20:33:26.5827084Z else: 2025-05-07T20:33:26.5827297Z scale_ub_tensor = None 2025-05-07T20:33:26.5827672Z 2025-05-07T20:33:26.5827901Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.5828223Z op = silu_mul_quant 2025-05-07T20:33:26.5828480Z if compiled: 2025-05-07T20:33:26.5828721Z op = torch.compile(op) 2025-05-07T20:33:26.5829112Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.5829398Z 2025-05-07T20:33:26.5829593Z > y_fp8, y_scale = fn() 2025-05-07T20:33:26.5829767Z 2025-05-07T20:33:26.5829867Z moe/activation_test.py:117: 2025-05-07T20:33:26.5830177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.5830585Z moe/activation_test.py:115: in fn 2025-05-07T20:33:26.5830879Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.5831470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:26.5832072Z return fn(*args, **kwargs) 2025-05-07T20:33:26.5832769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:26.5833500Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:26.5834065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.5834791Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.5835540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.5836116Z kernel = self.compile( 2025-05-07T20:33:26.5836695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.5837389Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.5837819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.5838075Z 2025-05-07T20:33:26.5838294Z self = 2025-05-07T20:33:26.5839443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.5840877Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e500c680>} 2025-05-07T20:33:26.5842302Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.5843396Z context = 2025-05-07T20:33:26.5843700Z 2025-05-07T20:33:26.5843877Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.5844428Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.5844919Z module_map=module_map) 2025-05-07T20:33:26.5845310Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.5845682Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:26.5845949Z E ^ 2025-05-07T20:33:26.5846446Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.5846926Z 2025-05-07T20:33:26.5847378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:26.5847922Z 2025-05-07T20:33:26.5848039Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:26.5848463Z self=, 2025-05-07T20:33:26.5848888Z T=1, 2025-05-07T20:33:26.5849138Z D=5120, 2025-05-07T20:33:26.5849335Z scale_ub=None, 2025-05-07T20:33:26.5849567Z contiguous=False, 2025-05-07T20:33:26.5849805Z compiled=True, 2025-05-07T20:33:26.5850007Z ) 2025-05-07T20:33:26.8129050Z self = 2025-05-07T20:33:26.8129695Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:26.8130012Z 2025-05-07T20:33:26.8130093Z @given( 2025-05-07T20:33:26.8130337Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.8130733Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.8131041Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.8131379Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.8131718Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.8132002Z ) 2025-05-07T20:33:26.8132363Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.8132826Z def test_silu_mul_quant( 2025-05-07T20:33:26.8133063Z self, 2025-05-07T20:33:26.8133263Z T: int, 2025-05-07T20:33:26.8133465Z D: int, 2025-05-07T20:33:26.8133676Z scale_ub: Optional[float], 2025-05-07T20:33:26.8133956Z contiguous: bool, 2025-05-07T20:33:26.8134196Z compiled: bool, 2025-05-07T20:33:26.8134545Z ) -> None: 2025-05-07T20:33:26.8134762Z torch.manual_seed(2025) 2025-05-07T20:33:26.8135008Z 2025-05-07T20:33:26.8135351Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.8135703Z 2025-05-07T20:33:26.8135907Z x_sign = torch.sign(x) 2025-05-07T20:33:26.8136204Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.8136515Z x = x_sign * x_clamp 2025-05-07T20:33:26.8136754Z x0 = x[:, :D] 2025-05-07T20:33:26.8136977Z x1 = x[:, D:] 2025-05-07T20:33:26.8137184Z 2025-05-07T20:33:26.8137377Z if contiguous: 2025-05-07T20:33:26.8137620Z x0 = x0.contiguous() 2025-05-07T20:33:26.8137876Z x1 = x1.contiguous() 2025-05-07T20:33:26.8138128Z 2025-05-07T20:33:26.8138322Z if scale_ub is not None: 2025-05-07T20:33:26.8138593Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.8138936Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.8139255Z ) 2025-05-07T20:33:26.8139446Z else: 2025-05-07T20:33:26.8139662Z scale_ub_tensor = None 2025-05-07T20:33:26.8139923Z 2025-05-07T20:33:26.8140173Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.8140493Z op = silu_mul_quant 2025-05-07T20:33:26.8140747Z if compiled: 2025-05-07T20:33:26.8140997Z op = torch.compile(op) 2025-05-07T20:33:26.8141294Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.8141582Z 2025-05-07T20:33:26.8141780Z y_fp8, y_scale = fn() 2025-05-07T20:33:26.8142072Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:26.8142374Z 2025-05-07T20:33:26.8142617Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.8142952Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:26.8143256Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:26.8143580Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:26.8143940Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:26.8144268Z 2025-05-07T20:33:26.8144475Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:26.8144682Z 2025-05-07T20:33:26.8144798Z moe/activation_test.py:126: 2025-05-07T20:33:26.8145086Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.8145432Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:26.8145776Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:26.8146594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:26.8147461Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:26.8148036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.8148799Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.8149525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:26.8150326Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:26.8151097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:26.8151769Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:26.8152396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:26.8152958Z fn() 2025-05-07T20:33:26.8153521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:26.8154159Z self.fn.run( 2025-05-07T20:33:26.8154652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.8155223Z kernel = self.compile( 2025-05-07T20:33:26.8155840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.8156529Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.8156944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.8157183Z 2025-05-07T20:33:26.8157412Z self = 2025-05-07T20:33:26.8158547Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.8159983Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e500eb60>} 2025-05-07T20:33:26.8161417Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.8162510Z context = 2025-05-07T20:33:26.8162812Z 2025-05-07T20:33:26.8163009Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.8163579Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.8164074Z module_map=module_map) 2025-05-07T20:33:26.8164452Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.8164822Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:26.8165099Z E ^ 2025-05-07T20:33:26.8165583Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.8166061Z 2025-05-07T20:33:26.8166504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:26.8167049Z 2025-05-07T20:33:26.8167160Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:26.8167585Z self=, 2025-05-07T20:33:26.8168008Z T=1, 2025-05-07T20:33:26.8168198Z D=5120, 2025-05-07T20:33:26.8168392Z scale_ub=None, 2025-05-07T20:33:26.8168612Z contiguous=True, 2025-05-07T20:33:26.8168896Z compiled=False, 2025-05-07T20:33:26.8169100Z ) 2025-05-07T20:33:26.9680235Z self = 2025-05-07T20:33:26.9681273Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:26.9681884Z 2025-05-07T20:33:26.9682053Z @given( 2025-05-07T20:33:26.9682789Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.9683449Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.9683887Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.9684234Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.9684641Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.9684928Z ) 2025-05-07T20:33:26.9685286Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.9685746Z def test_silu_mul_quant( 2025-05-07T20:33:26.9685988Z self, 2025-05-07T20:33:26.9686184Z T: int, 2025-05-07T20:33:26.9686386Z D: int, 2025-05-07T20:33:26.9686609Z scale_ub: Optional[float], 2025-05-07T20:33:26.9686889Z contiguous: bool, 2025-05-07T20:33:26.9687134Z compiled: bool, 2025-05-07T20:33:26.9687353Z ) -> None: 2025-05-07T20:33:26.9687575Z torch.manual_seed(2025) 2025-05-07T20:33:26.9687822Z 2025-05-07T20:33:26.9688100Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.9688452Z 2025-05-07T20:33:26.9688711Z x_sign = torch.sign(x) 2025-05-07T20:33:26.9689009Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.9689329Z x = x_sign * x_clamp 2025-05-07T20:33:26.9689574Z x0 = x[:, :D] 2025-05-07T20:33:26.9689794Z x1 = x[:, D:] 2025-05-07T20:33:26.9689998Z 2025-05-07T20:33:26.9690185Z if contiguous: 2025-05-07T20:33:26.9690420Z x0 = x0.contiguous() 2025-05-07T20:33:26.9690678Z x1 = x1.contiguous() 2025-05-07T20:33:26.9690923Z 2025-05-07T20:33:26.9691119Z if scale_ub is not None: 2025-05-07T20:33:26.9691392Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.9691732Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.9692051Z ) 2025-05-07T20:33:26.9692240Z else: 2025-05-07T20:33:26.9692457Z scale_ub_tensor = None 2025-05-07T20:33:26.9692715Z 2025-05-07T20:33:26.9692954Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.9693277Z op = silu_mul_quant 2025-05-07T20:33:26.9693532Z if compiled: 2025-05-07T20:33:26.9693788Z op = torch.compile(op) 2025-05-07T20:33:26.9694087Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.9694368Z 2025-05-07T20:33:26.9694685Z > y_fp8, y_scale = fn() 2025-05-07T20:33:26.9694857Z 2025-05-07T20:33:26.9694961Z moe/activation_test.py:117: 2025-05-07T20:33:26.9695266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.9695622Z moe/activation_test.py:115: in fn 2025-05-07T20:33:26.9695911Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.9696639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:26.9697368Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:26.9697933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.9698646Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.9699346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.9699911Z kernel = self.compile( 2025-05-07T20:33:26.9700478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.9701161Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.9701651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.9701890Z 2025-05-07T20:33:26.9702109Z self = 2025-05-07T20:33:26.9703279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.9704753Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e500f9c0>} 2025-05-07T20:33:26.9706172Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.9707274Z context = 2025-05-07T20:33:26.9707578Z 2025-05-07T20:33:26.9707758Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.9708299Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.9708792Z module_map=module_map) 2025-05-07T20:33:26.9709173Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.9709579Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:26.9709857Z E ^ 2025-05-07T20:33:26.9710342Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.9710816Z 2025-05-07T20:33:26.9711259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:26.9711801Z 2025-05-07T20:33:26.9711910Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:26.9712341Z self=, 2025-05-07T20:33:26.9712763Z T=128, 2025-05-07T20:33:26.9712963Z D=5120, 2025-05-07T20:33:26.9713152Z scale_ub=None, 2025-05-07T20:33:26.9713369Z contiguous=False, 2025-05-07T20:33:26.9713600Z compiled=True, 2025-05-07T20:33:26.9713800Z ) 2025-05-07T20:33:26.9714123Z self = 2025-05-07T20:33:26.9714637Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:26.9714919Z 2025-05-07T20:33:26.9714998Z @given( 2025-05-07T20:33:26.9715233Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:26.9715554Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:26.9715864Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:26.9716200Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:26.9716542Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:26.9716838Z ) 2025-05-07T20:33:26.9717186Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:26.9717637Z def test_silu_mul_quant( 2025-05-07T20:33:26.9717883Z self, 2025-05-07T20:33:26.9718076Z T: int, 2025-05-07T20:33:26.9718283Z D: int, 2025-05-07T20:33:26.9718504Z scale_ub: Optional[float], 2025-05-07T20:33:26.9718776Z contiguous: bool, 2025-05-07T20:33:26.9719022Z compiled: bool, 2025-05-07T20:33:26.9719247Z ) -> None: 2025-05-07T20:33:26.9719463Z torch.manual_seed(2025) 2025-05-07T20:33:26.9719709Z 2025-05-07T20:33:26.9719986Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:26.9720331Z 2025-05-07T20:33:26.9720526Z x_sign = torch.sign(x) 2025-05-07T20:33:26.9720818Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:26.9721126Z x = x_sign * x_clamp 2025-05-07T20:33:26.9721416Z x0 = x[:, :D] 2025-05-07T20:33:26.9721635Z x1 = x[:, D:] 2025-05-07T20:33:26.9721838Z 2025-05-07T20:33:26.9722026Z if contiguous: 2025-05-07T20:33:26.9722259Z x0 = x0.contiguous() 2025-05-07T20:33:26.9722521Z x1 = x1.contiguous() 2025-05-07T20:33:26.9722803Z 2025-05-07T20:33:26.9722996Z if scale_ub is not None: 2025-05-07T20:33:26.9723272Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:26.9723607Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:26.9724018Z ) 2025-05-07T20:33:26.9724220Z else: 2025-05-07T20:33:26.9724435Z scale_ub_tensor = None 2025-05-07T20:33:26.9724696Z 2025-05-07T20:33:26.9724931Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:26.9725244Z op = silu_mul_quant 2025-05-07T20:33:26.9725672Z if compiled: 2025-05-07T20:33:26.9725923Z op = torch.compile(op) 2025-05-07T20:33:26.9726225Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.9726510Z 2025-05-07T20:33:26.9726706Z > y_fp8, y_scale = fn() 2025-05-07T20:33:26.9726872Z 2025-05-07T20:33:26.9726979Z moe/activation_test.py:117: 2025-05-07T20:33:26.9727283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.9727625Z moe/activation_test.py:115: in fn 2025-05-07T20:33:26.9727916Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:26.9728560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:26.9729158Z return fn(*args, **kwargs) 2025-05-07T20:33:26.9729854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:26.9730590Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:26.9731152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:26.9731878Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:26.9732584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:26.9733148Z kernel = self.compile( 2025-05-07T20:33:26.9733718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:26.9734460Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:26.9734879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:26.9735117Z 2025-05-07T20:33:26.9735327Z self = 2025-05-07T20:33:26.9736453Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:26.9737885Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e500ca40>} 2025-05-07T20:33:26.9739297Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:26.9740377Z context = 2025-05-07T20:33:26.9740687Z 2025-05-07T20:33:26.9740859Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:26.9741397Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:26.9741875Z module_map=module_map) 2025-05-07T20:33:26.9742241Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:26.9742671Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:26.9742935Z E ^ 2025-05-07T20:33:26.9743411Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:26.9743892Z 2025-05-07T20:33:26.9744397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:26.9744946Z 2025-05-07T20:33:26.9745054Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:26.9745476Z self=, 2025-05-07T20:33:26.9745947Z T=128, 2025-05-07T20:33:26.9746136Z D=7168, 2025-05-07T20:33:26.9746334Z scale_ub=1200.0, 2025-05-07T20:33:26.9746553Z contiguous=False, 2025-05-07T20:33:26.9746785Z compiled=False, 2025-05-07T20:33:26.9746997Z ) 2025-05-07T20:33:27.0876847Z self = 2025-05-07T20:33:27.0878120Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:27.0878813Z 2025-05-07T20:33:27.0878990Z @given( 2025-05-07T20:33:27.0879458Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.0880118Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.0880737Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.0881395Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.0882312Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.0882899Z ) 2025-05-07T20:33:27.0883433Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.0883923Z def test_silu_mul_quant( 2025-05-07T20:33:27.0884169Z self, 2025-05-07T20:33:27.0884364Z T: int, 2025-05-07T20:33:27.0884564Z D: int, 2025-05-07T20:33:27.0891441Z scale_ub: Optional[float], 2025-05-07T20:33:27.0891762Z contiguous: bool, 2025-05-07T20:33:27.0892013Z compiled: bool, 2025-05-07T20:33:27.0892245Z ) -> None: 2025-05-07T20:33:27.0892475Z torch.manual_seed(2025) 2025-05-07T20:33:27.0892718Z 2025-05-07T20:33:27.0893007Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.0893363Z 2025-05-07T20:33:27.0893552Z x_sign = torch.sign(x) 2025-05-07T20:33:27.0893848Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.0894170Z x = x_sign * x_clamp 2025-05-07T20:33:27.0894512Z x0 = x[:, :D] 2025-05-07T20:33:27.0894742Z x1 = x[:, D:] 2025-05-07T20:33:27.0894957Z 2025-05-07T20:33:27.0895141Z if contiguous: 2025-05-07T20:33:27.0895377Z x0 = x0.contiguous() 2025-05-07T20:33:27.0895643Z x1 = x1.contiguous() 2025-05-07T20:33:27.0895891Z 2025-05-07T20:33:27.0896082Z if scale_ub is not None: 2025-05-07T20:33:27.0896361Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.0896704Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.0897019Z ) 2025-05-07T20:33:27.0897217Z else: 2025-05-07T20:33:27.0897434Z scale_ub_tensor = None 2025-05-07T20:33:27.0897690Z 2025-05-07T20:33:27.0897930Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.0898254Z op = silu_mul_quant 2025-05-07T20:33:27.0898503Z if compiled: 2025-05-07T20:33:27.0898761Z op = torch.compile(op) 2025-05-07T20:33:27.0899064Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.0899347Z 2025-05-07T20:33:27.0899546Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.0899716Z 2025-05-07T20:33:27.0899820Z moe/activation_test.py:117: 2025-05-07T20:33:27.0900123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.0900467Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.0900755Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.0901590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.0902321Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.0902944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.0903672Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.0904382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.0905004Z kernel = self.compile( 2025-05-07T20:33:27.0905578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.0906270Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.0906687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.0906935Z 2025-05-07T20:33:27.0908663Z self = 2025-05-07T20:33:27.0909804Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.0911279Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e5d34540>} 2025-05-07T20:33:27.0912702Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.0913840Z context = 2025-05-07T20:33:27.0914146Z 2025-05-07T20:33:27.0914322Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.0914870Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.0915361Z module_map=module_map) 2025-05-07T20:33:27.0915742Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.0916113Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.0916388Z E ^ 2025-05-07T20:33:27.0916875Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.0917357Z 2025-05-07T20:33:27.0917799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.0918352Z 2025-05-07T20:33:27.0918458Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.0918889Z self=, 2025-05-07T20:33:27.0919311Z T=128, 2025-05-07T20:33:27.0919509Z D=5120, 2025-05-07T20:33:27.0919713Z scale_ub=None, 2025-05-07T20:33:27.0919937Z contiguous=False, 2025-05-07T20:33:27.0920181Z compiled=False, 2025-05-07T20:33:27.0920396Z ) 2025-05-07T20:33:27.0920733Z self = 2025-05-07T20:33:27.0921251Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:27.0921537Z 2025-05-07T20:33:27.0921628Z @given( 2025-05-07T20:33:27.0921868Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.0922192Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.0922515Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.0922862Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.0923203Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.0923497Z ) 2025-05-07T20:33:27.0923891Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.0924421Z def test_silu_mul_quant( 2025-05-07T20:33:27.0924679Z self, 2025-05-07T20:33:27.0924888Z T: int, 2025-05-07T20:33:27.0925089Z D: int, 2025-05-07T20:33:27.0925317Z scale_ub: Optional[float], 2025-05-07T20:33:27.0925844Z contiguous: bool, 2025-05-07T20:33:27.0926089Z compiled: bool, 2025-05-07T20:33:27.0926318Z ) -> None: 2025-05-07T20:33:27.0926538Z torch.manual_seed(2025) 2025-05-07T20:33:27.0926778Z 2025-05-07T20:33:27.0927057Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.0927479Z 2025-05-07T20:33:27.0927676Z x_sign = torch.sign(x) 2025-05-07T20:33:27.0927963Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.0928276Z x = x_sign * x_clamp 2025-05-07T20:33:27.0928514Z x0 = x[:, :D] 2025-05-07T20:33:27.0928727Z x1 = x[:, D:] 2025-05-07T20:33:27.0928939Z 2025-05-07T20:33:27.0929124Z if contiguous: 2025-05-07T20:33:27.0929352Z x0 = x0.contiguous() 2025-05-07T20:33:27.0929615Z x1 = x1.contiguous() 2025-05-07T20:33:27.0929858Z 2025-05-07T20:33:27.0930045Z if scale_ub is not None: 2025-05-07T20:33:27.0930325Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.0930664Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.0930977Z ) 2025-05-07T20:33:27.0931180Z else: 2025-05-07T20:33:27.0931455Z scale_ub_tensor = None 2025-05-07T20:33:27.0931709Z 2025-05-07T20:33:27.0931943Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.0932270Z op = silu_mul_quant 2025-05-07T20:33:27.0932525Z if compiled: 2025-05-07T20:33:27.0932771Z op = torch.compile(op) 2025-05-07T20:33:27.0933071Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.0933359Z 2025-05-07T20:33:27.0933552Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.0933722Z 2025-05-07T20:33:27.0933820Z moe/activation_test.py:117: 2025-05-07T20:33:27.0934118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.0934526Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.0934816Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.0935535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.0936263Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.0936819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.0937534Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.0938230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.0938786Z kernel = self.compile( 2025-05-07T20:33:27.0939346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.0940031Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.0940444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.0940681Z 2025-05-07T20:33:27.0940890Z self = 2025-05-07T20:33:27.0942013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.0943439Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e50c0400>} 2025-05-07T20:33:27.0944850Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.0946003Z context = 2025-05-07T20:33:27.0946305Z 2025-05-07T20:33:27.0946519Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.0947072Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.0947566Z module_map=module_map) 2025-05-07T20:33:27.0947981Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.0948351Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.0948627Z E ^ 2025-05-07T20:33:27.0949115Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.0949588Z 2025-05-07T20:33:27.0950033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.0950586Z 2025-05-07T20:33:27.0950695Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.0951124Z self=, 2025-05-07T20:33:27.0951542Z T=128, 2025-05-07T20:33:27.0951744Z D=5120, 2025-05-07T20:33:27.0951948Z scale_ub=1200.0, 2025-05-07T20:33:27.0952173Z contiguous=True, 2025-05-07T20:33:27.0952455Z compiled=False, 2025-05-07T20:33:27.0952673Z ) 2025-05-07T20:33:27.2680730Z self = 2025-05-07T20:33:27.2682039Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:27.2682653Z 2025-05-07T20:33:27.2682822Z @given( 2025-05-07T20:33:27.2683140Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.2683505Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.2683827Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.2684159Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.2684487Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.2684780Z ) 2025-05-07T20:33:27.2685138Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.2685587Z def test_silu_mul_quant( 2025-05-07T20:33:27.2685826Z self, 2025-05-07T20:33:27.2686025Z T: int, 2025-05-07T20:33:27.2686212Z D: int, 2025-05-07T20:33:27.2686435Z scale_ub: Optional[float], 2025-05-07T20:33:27.2686706Z contiguous: bool, 2025-05-07T20:33:27.2686936Z compiled: bool, 2025-05-07T20:33:27.2687156Z ) -> None: 2025-05-07T20:33:27.2687367Z torch.manual_seed(2025) 2025-05-07T20:33:27.2687597Z 2025-05-07T20:33:27.2687867Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.2688216Z 2025-05-07T20:33:27.2688409Z x_sign = torch.sign(x) 2025-05-07T20:33:27.2688694Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.2689003Z x = x_sign * x_clamp 2025-05-07T20:33:27.2689239Z x0 = x[:, :D] 2025-05-07T20:33:27.2689455Z x1 = x[:, D:] 2025-05-07T20:33:27.2689654Z 2025-05-07T20:33:27.2689838Z if contiguous: 2025-05-07T20:33:27.2690067Z x0 = x0.contiguous() 2025-05-07T20:33:27.2690316Z x1 = x1.contiguous() 2025-05-07T20:33:27.2690560Z 2025-05-07T20:33:27.2690748Z if scale_ub is not None: 2025-05-07T20:33:27.2691015Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.2691350Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.2691659Z ) 2025-05-07T20:33:27.2691848Z else: 2025-05-07T20:33:27.2692049Z scale_ub_tensor = None 2025-05-07T20:33:27.2692299Z 2025-05-07T20:33:27.2692524Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.2692969Z op = silu_mul_quant 2025-05-07T20:33:27.2693221Z if compiled: 2025-05-07T20:33:27.2693479Z op = torch.compile(op) 2025-05-07T20:33:27.2693774Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.2694058Z 2025-05-07T20:33:27.2694316Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.2694587Z 2025-05-07T20:33:27.2694689Z moe/activation_test.py:117: 2025-05-07T20:33:27.2694994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.2695433Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.2695724Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.2696440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.2697167Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.2697728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.2698448Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.2699148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.2699715Z kernel = self.compile( 2025-05-07T20:33:27.2700283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.2701032Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.2701449Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.2701685Z 2025-05-07T20:33:27.2701905Z self = 2025-05-07T20:33:27.2703031Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.2704456Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e50c1300>} 2025-05-07T20:33:27.2705870Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.2706964Z context = 2025-05-07T20:33:27.2707269Z 2025-05-07T20:33:27.2707447Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.2707981Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.2708468Z module_map=module_map) 2025-05-07T20:33:27.2708842Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.2709206Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.2709472Z E ^ 2025-05-07T20:33:27.2709950Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.2710419Z 2025-05-07T20:33:27.2710865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.2711407Z 2025-05-07T20:33:27.2711518Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.2711941Z self=, 2025-05-07T20:33:27.2712361Z T=1, 2025-05-07T20:33:27.2712550Z D=7168, 2025-05-07T20:33:27.2712740Z scale_ub=1200.0, 2025-05-07T20:33:27.2712973Z contiguous=True, 2025-05-07T20:33:27.2713199Z compiled=True, 2025-05-07T20:33:27.2713426Z ) 2025-05-07T20:33:27.2713772Z self = 2025-05-07T20:33:27.2714332Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:27.2714606Z 2025-05-07T20:33:27.2714682Z @given( 2025-05-07T20:33:27.2714918Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.2715279Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.2715593Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.2715925Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.2716267Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.2716607Z ) 2025-05-07T20:33:27.2716964Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.2717424Z def test_silu_mul_quant( 2025-05-07T20:33:27.2717674Z self, 2025-05-07T20:33:27.2717866Z T: int, 2025-05-07T20:33:27.2718060Z D: int, 2025-05-07T20:33:27.2718274Z scale_ub: Optional[float], 2025-05-07T20:33:27.2718546Z contiguous: bool, 2025-05-07T20:33:27.2718788Z compiled: bool, 2025-05-07T20:33:27.2719007Z ) -> None: 2025-05-07T20:33:27.2719210Z torch.manual_seed(2025) 2025-05-07T20:33:27.2719446Z 2025-05-07T20:33:27.2719722Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.2720069Z 2025-05-07T20:33:27.2720262Z x_sign = torch.sign(x) 2025-05-07T20:33:27.2720547Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.2720900Z x = x_sign * x_clamp 2025-05-07T20:33:27.2721133Z x0 = x[:, :D] 2025-05-07T20:33:27.2721352Z x1 = x[:, D:] 2025-05-07T20:33:27.2721561Z 2025-05-07T20:33:27.2721737Z if contiguous: 2025-05-07T20:33:27.2721966Z x0 = x0.contiguous() 2025-05-07T20:33:27.2722223Z x1 = x1.contiguous() 2025-05-07T20:33:27.2722461Z 2025-05-07T20:33:27.2722647Z if scale_ub is not None: 2025-05-07T20:33:27.2722920Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.2723267Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.2723611Z ) 2025-05-07T20:33:27.2723798Z else: 2025-05-07T20:33:27.2724001Z scale_ub_tensor = None 2025-05-07T20:33:27.2724254Z 2025-05-07T20:33:27.2724490Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.2724802Z op = silu_mul_quant 2025-05-07T20:33:27.2725054Z if compiled: 2025-05-07T20:33:27.2725300Z op = torch.compile(op) 2025-05-07T20:33:27.2725786Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.2726065Z 2025-05-07T20:33:27.2726252Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.2726415Z 2025-05-07T20:33:27.2726517Z moe/activation_test.py:117: 2025-05-07T20:33:27.2726809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.2727141Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.2727421Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.2727992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:27.2728571Z return fn(*args, **kwargs) 2025-05-07T20:33:27.2729259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.2729985Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.2730540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.2731254Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.2731946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.2732497Z kernel = self.compile( 2025-05-07T20:33:27.2733055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.2733815Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.2734226Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.2734532Z 2025-05-07T20:33:27.2734804Z self = 2025-05-07T20:33:27.2735934Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.2737416Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e50c2ac0>} 2025-05-07T20:33:27.2738825Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.2739913Z context = 2025-05-07T20:33:27.2740210Z 2025-05-07T20:33:27.2740379Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.2740924Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.2741416Z module_map=module_map) 2025-05-07T20:33:27.2741840Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.2742213Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.2742480Z E ^ 2025-05-07T20:33:27.2742966Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.2743439Z 2025-05-07T20:33:27.2743875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.2744422Z 2025-05-07T20:33:27.2744526Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.2744954Z self=, 2025-05-07T20:33:27.2745369Z T=1, 2025-05-07T20:33:27.2745548Z D=7168, 2025-05-07T20:33:27.2745733Z scale_ub=1200.0, 2025-05-07T20:33:27.2745956Z contiguous=False, 2025-05-07T20:33:27.2746169Z compiled=True, 2025-05-07T20:33:27.2746367Z ) 2025-05-07T20:33:27.4095588Z self = 2025-05-07T20:33:27.4096345Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:27.4096728Z 2025-05-07T20:33:27.4096839Z @given( 2025-05-07T20:33:27.4097113Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.4097433Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.4097737Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.4098071Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.4098410Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.4098704Z ) 2025-05-07T20:33:27.4099055Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.4099522Z def test_silu_mul_quant( 2025-05-07T20:33:27.4099770Z self, 2025-05-07T20:33:27.4099961Z T: int, 2025-05-07T20:33:27.4100154Z D: int, 2025-05-07T20:33:27.4100375Z scale_ub: Optional[float], 2025-05-07T20:33:27.4100644Z contiguous: bool, 2025-05-07T20:33:27.4100883Z compiled: bool, 2025-05-07T20:33:27.4101114Z ) -> None: 2025-05-07T20:33:27.4101323Z torch.manual_seed(2025) 2025-05-07T20:33:27.4101566Z 2025-05-07T20:33:27.4101844Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.4102194Z 2025-05-07T20:33:27.4102389Z x_sign = torch.sign(x) 2025-05-07T20:33:27.4102682Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.4103106Z x = x_sign * x_clamp 2025-05-07T20:33:27.4103387Z x0 = x[:, :D] 2025-05-07T20:33:27.4103602Z x1 = x[:, D:] 2025-05-07T20:33:27.4103812Z 2025-05-07T20:33:27.4103991Z if contiguous: 2025-05-07T20:33:27.4104223Z x0 = x0.contiguous() 2025-05-07T20:33:27.4104546Z x1 = x1.contiguous() 2025-05-07T20:33:27.4104784Z 2025-05-07T20:33:27.4104973Z if scale_ub is not None: 2025-05-07T20:33:27.4105254Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.4105589Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.4105964Z ) 2025-05-07T20:33:27.4106157Z else: 2025-05-07T20:33:27.4106361Z scale_ub_tensor = None 2025-05-07T20:33:27.4106615Z 2025-05-07T20:33:27.4106845Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.4107156Z op = silu_mul_quant 2025-05-07T20:33:27.4107405Z if compiled: 2025-05-07T20:33:27.4107667Z op = torch.compile(op) 2025-05-07T20:33:27.4107960Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.4108241Z 2025-05-07T20:33:27.4108427Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.4108593Z 2025-05-07T20:33:27.4108694Z moe/activation_test.py:117: 2025-05-07T20:33:27.4108987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.4109325Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.4109671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.4110250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:27.4110835Z return fn(*args, **kwargs) 2025-05-07T20:33:27.4111526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.4112246Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.4112806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.4113519Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.4114219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.4114773Z kernel = self.compile( 2025-05-07T20:33:27.4122071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.4122788Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.4123211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.4123455Z 2025-05-07T20:33:27.4123667Z self = 2025-05-07T20:33:27.4124797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.4126543Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e50039c0>} 2025-05-07T20:33:27.4127959Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.4129045Z context = 2025-05-07T20:33:27.4129352Z 2025-05-07T20:33:27.4129526Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.4130064Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.4130556Z module_map=module_map) 2025-05-07T20:33:27.4131041Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.4131407Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.4131672Z E ^ 2025-05-07T20:33:27.4132213Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.4132694Z 2025-05-07T20:33:27.4133136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.4133685Z 2025-05-07T20:33:27.4133850Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.4134272Z self=, 2025-05-07T20:33:27.4134786Z T=1, 2025-05-07T20:33:27.4134971Z D=7168, 2025-05-07T20:33:27.4135171Z scale_ub=None, 2025-05-07T20:33:27.4135383Z contiguous=False, 2025-05-07T20:33:27.4135611Z compiled=True, 2025-05-07T20:33:27.4135815Z ) 2025-05-07T20:33:27.5006085Z self = 2025-05-07T20:33:27.5006844Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:27.5007219Z 2025-05-07T20:33:27.5007329Z @given( 2025-05-07T20:33:27.5007561Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.5007899Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.5008218Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.5008732Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.5009081Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.5009380Z ) 2025-05-07T20:33:27.5009731Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.5010187Z def test_silu_mul_quant( 2025-05-07T20:33:27.5010431Z self, 2025-05-07T20:33:27.5010625Z T: int, 2025-05-07T20:33:27.5010811Z D: int, 2025-05-07T20:33:27.5011029Z scale_ub: Optional[float], 2025-05-07T20:33:27.5011310Z contiguous: bool, 2025-05-07T20:33:27.5011541Z compiled: bool, 2025-05-07T20:33:27.5011764Z ) -> None: 2025-05-07T20:33:27.5011978Z torch.manual_seed(2025) 2025-05-07T20:33:27.5012207Z 2025-05-07T20:33:27.5012485Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.5012842Z 2025-05-07T20:33:27.5013028Z x_sign = torch.sign(x) 2025-05-07T20:33:27.5013355Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.5013694Z x = x_sign * x_clamp 2025-05-07T20:33:27.5013934Z x0 = x[:, :D] 2025-05-07T20:33:27.5014153Z x1 = x[:, D:] 2025-05-07T20:33:27.5014364Z 2025-05-07T20:33:27.5014662Z if contiguous: 2025-05-07T20:33:27.5014900Z x0 = x0.contiguous() 2025-05-07T20:33:27.5015173Z x1 = x1.contiguous() 2025-05-07T20:33:27.5015407Z 2025-05-07T20:33:27.5015602Z if scale_ub is not None: 2025-05-07T20:33:27.5015883Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.5016228Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.5016537Z ) 2025-05-07T20:33:27.5016732Z else: 2025-05-07T20:33:27.5016947Z scale_ub_tensor = None 2025-05-07T20:33:27.5017198Z 2025-05-07T20:33:27.5017440Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.5017767Z op = silu_mul_quant 2025-05-07T20:33:27.5018018Z if compiled: 2025-05-07T20:33:27.5018271Z op = torch.compile(op) 2025-05-07T20:33:27.5018581Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.5018858Z 2025-05-07T20:33:27.5019049Z y_fp8, y_scale = fn() 2025-05-07T20:33:27.5019341Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:27.5019638Z 2025-05-07T20:33:27.5019875Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.5020222Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:27.5020599Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:27.5020920Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:27.5021295Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:27.5021618Z 2025-05-07T20:33:27.5021955Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:27.5022160Z 2025-05-07T20:33:27.5022260Z moe/activation_test.py:126: 2025-05-07T20:33:27.5022568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.5022965Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:27.5023299Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:27.5024126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:27.5024929Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:27.5025676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.5026400Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.5027129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:27.5027892Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:27.5028725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:27.5029409Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:27.5030044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:27.5030586Z fn() 2025-05-07T20:33:27.5031117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:27.5031748Z self.fn.run( 2025-05-07T20:33:27.5032240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.5032798Z kernel = self.compile( 2025-05-07T20:33:27.5033372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.5034069Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.5034486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.5034741Z 2025-05-07T20:33:27.5034955Z self = 2025-05-07T20:33:27.5036101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.5037557Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099a478b80>} 2025-05-07T20:33:27.5038992Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.5040086Z context = 2025-05-07T20:33:27.5040398Z 2025-05-07T20:33:27.5040578Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.5041129Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.5041622Z module_map=module_map) 2025-05-07T20:33:27.5041998Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.5042372Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:27.5042717Z E ^ 2025-05-07T20:33:27.5043187Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.5043670Z 2025-05-07T20:33:27.5044162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.5044720Z 2025-05-07T20:33:27.5044821Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.5045246Z self=, 2025-05-07T20:33:27.5045716Z T=1, 2025-05-07T20:33:27.5045896Z D=5120, 2025-05-07T20:33:27.5046088Z scale_ub=1200.0, 2025-05-07T20:33:27.5046307Z contiguous=False, 2025-05-07T20:33:27.5046534Z compiled=True, 2025-05-07T20:33:27.5046728Z ) 2025-05-07T20:33:27.6605820Z self = 2025-05-07T20:33:27.6606624Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:27.6607006Z 2025-05-07T20:33:27.6607094Z @given( 2025-05-07T20:33:27.6607351Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.6607798Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.6608223Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.6608675Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.6609011Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.6609311Z ) 2025-05-07T20:33:27.6609774Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.6610234Z def test_silu_mul_quant( 2025-05-07T20:33:27.6610484Z self, 2025-05-07T20:33:27.6610675Z T: int, 2025-05-07T20:33:27.6610870Z D: int, 2025-05-07T20:33:27.6611082Z scale_ub: Optional[float], 2025-05-07T20:33:27.6611354Z contiguous: bool, 2025-05-07T20:33:27.6611596Z compiled: bool, 2025-05-07T20:33:27.6611816Z ) -> None: 2025-05-07T20:33:27.6612024Z torch.manual_seed(2025) 2025-05-07T20:33:27.6612266Z 2025-05-07T20:33:27.6612540Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.6612886Z 2025-05-07T20:33:27.6613081Z x_sign = torch.sign(x) 2025-05-07T20:33:27.6613380Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.6613692Z x = x_sign * x_clamp 2025-05-07T20:33:27.6613937Z x0 = x[:, :D] 2025-05-07T20:33:27.6614151Z x1 = x[:, D:] 2025-05-07T20:33:27.6614355Z 2025-05-07T20:33:27.6614670Z if contiguous: 2025-05-07T20:33:27.6614903Z x0 = x0.contiguous() 2025-05-07T20:33:27.6615160Z x1 = x1.contiguous() 2025-05-07T20:33:27.6615402Z 2025-05-07T20:33:27.6615591Z if scale_ub is not None: 2025-05-07T20:33:27.6615860Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.6616201Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.6616516Z ) 2025-05-07T20:33:27.6616706Z else: 2025-05-07T20:33:27.6616911Z scale_ub_tensor = None 2025-05-07T20:33:27.6617166Z 2025-05-07T20:33:27.6617403Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.6617718Z op = silu_mul_quant 2025-05-07T20:33:27.6617974Z if compiled: 2025-05-07T20:33:27.6618231Z op = torch.compile(op) 2025-05-07T20:33:27.6618529Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.6618806Z 2025-05-07T20:33:27.6618998Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.6619163Z 2025-05-07T20:33:27.6619261Z moe/activation_test.py:117: 2025-05-07T20:33:27.6619561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.6619904Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.6620189Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.6620770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:27.6621438Z return fn(*args, **kwargs) 2025-05-07T20:33:27.6622124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.6622915Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.6623485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.6624202Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.6624960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.6625689Z kernel = self.compile( 2025-05-07T20:33:27.6626258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.6626945Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.6627356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.6627598Z 2025-05-07T20:33:27.6627811Z self = 2025-05-07T20:33:27.6629000Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.6630435Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099a479e40>} 2025-05-07T20:33:27.6631874Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.6632956Z context = 2025-05-07T20:33:27.6633259Z 2025-05-07T20:33:27.6633431Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.6633977Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.6634467Z module_map=module_map) 2025-05-07T20:33:27.6634835Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.6635206Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.6635478Z E ^ 2025-05-07T20:33:27.6635959Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.6636438Z 2025-05-07T20:33:27.6636875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.6637421Z 2025-05-07T20:33:27.6637525Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.6637954Z self=, 2025-05-07T20:33:27.6638368Z T=1, 2025-05-07T20:33:27.6638558Z D=5120, 2025-05-07T20:33:27.6638749Z scale_ub=1200.0, 2025-05-07T20:33:27.6638969Z contiguous=False, 2025-05-07T20:33:27.6639200Z compiled=False, 2025-05-07T20:33:27.6639404Z ) 2025-05-07T20:33:27.6639718Z self = 2025-05-07T20:33:27.6640225Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:27.6640507Z 2025-05-07T20:33:27.6640581Z @given( 2025-05-07T20:33:27.6640811Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.6641122Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.6641430Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.6641761Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.6642085Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.6642440Z ) 2025-05-07T20:33:27.6642790Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.6643233Z def test_silu_mul_quant( 2025-05-07T20:33:27.6643475Z self, 2025-05-07T20:33:27.6643664Z T: int, 2025-05-07T20:33:27.6643913Z D: int, 2025-05-07T20:33:27.6644135Z scale_ub: Optional[float], 2025-05-07T20:33:27.6644410Z contiguous: bool, 2025-05-07T20:33:27.6644648Z compiled: bool, 2025-05-07T20:33:27.6644865Z ) -> None: 2025-05-07T20:33:27.6645077Z torch.manual_seed(2025) 2025-05-07T20:33:27.6645409Z 2025-05-07T20:33:27.6645679Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.6646031Z 2025-05-07T20:33:27.6646220Z x_sign = torch.sign(x) 2025-05-07T20:33:27.6646503Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.6646815Z x = x_sign * x_clamp 2025-05-07T20:33:27.6647053Z x0 = x[:, :D] 2025-05-07T20:33:27.6647262Z x1 = x[:, D:] 2025-05-07T20:33:27.6647460Z 2025-05-07T20:33:27.6647637Z if contiguous: 2025-05-07T20:33:27.6647858Z x0 = x0.contiguous() 2025-05-07T20:33:27.6648113Z x1 = x1.contiguous() 2025-05-07T20:33:27.6648358Z 2025-05-07T20:33:27.6648547Z if scale_ub is not None: 2025-05-07T20:33:27.6648819Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.6649203Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.6649515Z ) 2025-05-07T20:33:27.6649709Z else: 2025-05-07T20:33:27.6649918Z scale_ub_tensor = None 2025-05-07T20:33:27.6650171Z 2025-05-07T20:33:27.6650396Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.6650719Z op = silu_mul_quant 2025-05-07T20:33:27.6650966Z if compiled: 2025-05-07T20:33:27.6651203Z op = torch.compile(op) 2025-05-07T20:33:27.6651506Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.6651787Z 2025-05-07T20:33:27.6651970Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.6652138Z 2025-05-07T20:33:27.6652237Z moe/activation_test.py:117: 2025-05-07T20:33:27.6652529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.6652860Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.6653146Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.6653863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.6654640Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.6655191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.6655903Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.6656597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.6657156Z kernel = self.compile( 2025-05-07T20:33:27.6657708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.6658388Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.6658799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.6659032Z 2025-05-07T20:33:27.6659243Z self = 2025-05-07T20:33:27.6660363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.6661791Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099a47aac0>} 2025-05-07T20:33:27.6663252Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.6664376Z context = 2025-05-07T20:33:27.6664676Z 2025-05-07T20:33:27.6664851Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.6665394Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.6665950Z module_map=module_map) 2025-05-07T20:33:27.6666321Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.6666678Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.6666945Z E ^ 2025-05-07T20:33:27.6667424Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.6667894Z 2025-05-07T20:33:27.6668329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.6668874Z 2025-05-07T20:33:27.6668977Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.6669399Z self=, 2025-05-07T20:33:27.6669818Z T=16384, 2025-05-07T20:33:27.6670055Z D=5120, 2025-05-07T20:33:27.6670249Z scale_ub=1200.0, 2025-05-07T20:33:27.6670474Z contiguous=False, 2025-05-07T20:33:27.6670694Z compiled=True, 2025-05-07T20:33:27.6670895Z ) 2025-05-07T20:33:27.7544566Z self = 2025-05-07T20:33:27.7545374Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:27.7545775Z 2025-05-07T20:33:27.7545878Z @given( 2025-05-07T20:33:27.7546234Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.7546585Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.7546900Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.7547238Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.7547581Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.7547878Z ) 2025-05-07T20:33:27.7548239Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.7548700Z def test_silu_mul_quant( 2025-05-07T20:33:27.7548949Z self, 2025-05-07T20:33:27.7549146Z T: int, 2025-05-07T20:33:27.7549338Z D: int, 2025-05-07T20:33:27.7549557Z scale_ub: Optional[float], 2025-05-07T20:33:27.7549832Z contiguous: bool, 2025-05-07T20:33:27.7550074Z compiled: bool, 2025-05-07T20:33:27.7550298Z ) -> None: 2025-05-07T20:33:27.7550516Z torch.manual_seed(2025) 2025-05-07T20:33:27.7550747Z 2025-05-07T20:33:27.7551018Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.7551365Z 2025-05-07T20:33:27.7551550Z x_sign = torch.sign(x) 2025-05-07T20:33:27.7551844Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.7552160Z x = x_sign * x_clamp 2025-05-07T20:33:27.7552398Z x0 = x[:, :D] 2025-05-07T20:33:27.7552614Z x1 = x[:, D:] 2025-05-07T20:33:27.7552816Z 2025-05-07T20:33:27.7552996Z if contiguous: 2025-05-07T20:33:27.7553224Z x0 = x0.contiguous() 2025-05-07T20:33:27.7553527Z x1 = x1.contiguous() 2025-05-07T20:33:27.7553774Z 2025-05-07T20:33:27.7553959Z if scale_ub is not None: 2025-05-07T20:33:27.7554231Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.7554564Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.7554867Z ) 2025-05-07T20:33:27.7555061Z else: 2025-05-07T20:33:27.7555272Z scale_ub_tensor = None 2025-05-07T20:33:27.7555630Z 2025-05-07T20:33:27.7555857Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.7556174Z op = silu_mul_quant 2025-05-07T20:33:27.7556425Z if compiled: 2025-05-07T20:33:27.7556670Z op = torch.compile(op) 2025-05-07T20:33:27.7557030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.7557301Z 2025-05-07T20:33:27.7557489Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.7557650Z 2025-05-07T20:33:27.7557760Z moe/activation_test.py:117: 2025-05-07T20:33:27.7558118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.7558447Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.7558729Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.7559304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:27.7559884Z return fn(*args, **kwargs) 2025-05-07T20:33:27.7560565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.7561285Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.7561849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.7562555Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.7563302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.7563866Z kernel = self.compile( 2025-05-07T20:33:27.7564426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.7565114Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.7565524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.7565764Z 2025-05-07T20:33:27.7565979Z self = 2025-05-07T20:33:27.7567095Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.7568527Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4d58180>} 2025-05-07T20:33:27.7569933Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.7571017Z context = 2025-05-07T20:33:27.7571314Z 2025-05-07T20:33:27.7571490Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.7572026Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.7572509Z module_map=module_map) 2025-05-07T20:33:27.7572888Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.7573255Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.7573566Z E ^ 2025-05-07T20:33:27.7574048Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.7574663Z 2025-05-07T20:33:27.7575108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.7575651Z 2025-05-07T20:33:27.7575758Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.7576190Z self=, 2025-05-07T20:33:27.7576663Z T=2048, 2025-05-07T20:33:27.7576845Z D=7168, 2025-05-07T20:33:27.7577045Z scale_ub=1200.0, 2025-05-07T20:33:27.7577276Z contiguous=False, 2025-05-07T20:33:27.7577506Z compiled=True, 2025-05-07T20:33:27.7584106Z ) 2025-05-07T20:33:27.7584530Z self = 2025-05-07T20:33:27.7585056Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:27.7585347Z 2025-05-07T20:33:27.7585436Z @given( 2025-05-07T20:33:27.7585682Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.7586048Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.7586369Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.7586710Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.7587040Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.7587335Z ) 2025-05-07T20:33:27.7587690Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.7588150Z def test_silu_mul_quant( 2025-05-07T20:33:27.7588402Z self, 2025-05-07T20:33:27.7588607Z T: int, 2025-05-07T20:33:27.7588804Z D: int, 2025-05-07T20:33:27.7589031Z scale_ub: Optional[float], 2025-05-07T20:33:27.7589317Z contiguous: bool, 2025-05-07T20:33:27.7589562Z compiled: bool, 2025-05-07T20:33:27.7589790Z ) -> None: 2025-05-07T20:33:27.7590015Z torch.manual_seed(2025) 2025-05-07T20:33:27.7590305Z 2025-05-07T20:33:27.7590593Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.7590955Z 2025-05-07T20:33:27.7591158Z x_sign = torch.sign(x) 2025-05-07T20:33:27.7591453Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.7591780Z x = x_sign * x_clamp 2025-05-07T20:33:27.7592036Z x0 = x[:, :D] 2025-05-07T20:33:27.7592256Z x1 = x[:, D:] 2025-05-07T20:33:27.7592472Z 2025-05-07T20:33:27.7592666Z if contiguous: 2025-05-07T20:33:27.7592904Z x0 = x0.contiguous() 2025-05-07T20:33:27.7593172Z x1 = x1.contiguous() 2025-05-07T20:33:27.7593419Z 2025-05-07T20:33:27.7593614Z if scale_ub is not None: 2025-05-07T20:33:27.7593893Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.7594239Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.7594559Z ) 2025-05-07T20:33:27.7594758Z else: 2025-05-07T20:33:27.7594976Z scale_ub_tensor = None 2025-05-07T20:33:27.7595236Z 2025-05-07T20:33:27.7595472Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.7595802Z op = silu_mul_quant 2025-05-07T20:33:27.7596065Z if compiled: 2025-05-07T20:33:27.7596314Z op = torch.compile(op) 2025-05-07T20:33:27.7596620Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.7596906Z 2025-05-07T20:33:27.7597102Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.7597275Z 2025-05-07T20:33:27.7597378Z moe/activation_test.py:117: 2025-05-07T20:33:27.7597684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.7598026Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.7598319Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.7598909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:27.7599501Z return fn(*args, **kwargs) 2025-05-07T20:33:27.7600189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.7600918Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.7601480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.7602191Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.7602942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.7603513Z kernel = self.compile( 2025-05-07T20:33:27.7604169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.7604860Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.7605272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.7605510Z 2025-05-07T20:33:27.7605771Z self = 2025-05-07T20:33:27.7606902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.7608331Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4d58ea0>} 2025-05-07T20:33:27.7609747Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.7610837Z context = 2025-05-07T20:33:27.7611138Z 2025-05-07T20:33:27.7611368Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.7611912Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.7612403Z module_map=module_map) 2025-05-07T20:33:27.7612792Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.7613162Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.7613431Z E ^ 2025-05-07T20:33:27.7613950Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.7614503Z 2025-05-07T20:33:27.7614948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.7615499Z 2025-05-07T20:33:27.8768132Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.8769297Z self=, 2025-05-07T20:33:27.8770369Z T=1, 2025-05-07T20:33:27.8770831Z D=5120, 2025-05-07T20:33:27.8771299Z scale_ub=None, 2025-05-07T20:33:27.8771826Z contiguous=False, 2025-05-07T20:33:27.8772232Z compiled=False, 2025-05-07T20:33:27.8772582Z ) 2025-05-07T20:33:27.8773164Z self = 2025-05-07T20:33:27.8773861Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:27.8774136Z 2025-05-07T20:33:27.8774220Z @given( 2025-05-07T20:33:27.8774510Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.8774824Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.8775131Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.8775458Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.8775797Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.8776085Z ) 2025-05-07T20:33:27.8776432Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.8776884Z def test_silu_mul_quant( 2025-05-07T20:33:27.8777126Z self, 2025-05-07T20:33:27.8777318Z T: int, 2025-05-07T20:33:27.8777499Z D: int, 2025-05-07T20:33:27.8777715Z scale_ub: Optional[float], 2025-05-07T20:33:27.8777986Z contiguous: bool, 2025-05-07T20:33:27.8778221Z compiled: bool, 2025-05-07T20:33:27.8778443Z ) -> None: 2025-05-07T20:33:27.8778660Z torch.manual_seed(2025) 2025-05-07T20:33:27.8779016Z 2025-05-07T20:33:27.8779298Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.8779661Z 2025-05-07T20:33:27.8779860Z x_sign = torch.sign(x) 2025-05-07T20:33:27.8780164Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.8780550Z x = x_sign * x_clamp 2025-05-07T20:33:27.8780796Z x0 = x[:, :D] 2025-05-07T20:33:27.8781018Z x1 = x[:, D:] 2025-05-07T20:33:27.8781227Z 2025-05-07T20:33:27.8781414Z if contiguous: 2025-05-07T20:33:27.8781709Z x0 = x0.contiguous() 2025-05-07T20:33:27.8781978Z x1 = x1.contiguous() 2025-05-07T20:33:27.8782219Z 2025-05-07T20:33:27.8782409Z if scale_ub is not None: 2025-05-07T20:33:27.8782692Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.8783029Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.8783343Z ) 2025-05-07T20:33:27.8783547Z else: 2025-05-07T20:33:27.8783748Z scale_ub_tensor = None 2025-05-07T20:33:27.8783991Z 2025-05-07T20:33:27.8784219Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.8784538Z op = silu_mul_quant 2025-05-07T20:33:27.8784786Z if compiled: 2025-05-07T20:33:27.8785041Z op = torch.compile(op) 2025-05-07T20:33:27.8785336Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.8785606Z 2025-05-07T20:33:27.8785859Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.8786024Z 2025-05-07T20:33:27.8786132Z moe/activation_test.py:117: 2025-05-07T20:33:27.8786432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.8786771Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.8787049Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.8787798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.8788528Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.8789080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.8789802Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.8790493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.8791044Z kernel = self.compile( 2025-05-07T20:33:27.8791598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.8792285Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.8792691Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.8792923Z 2025-05-07T20:33:27.8793139Z self = 2025-05-07T20:33:27.8794256Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.8795680Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4d59e40>} 2025-05-07T20:33:27.8797084Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.8798169Z context = 2025-05-07T20:33:27.8798469Z 2025-05-07T20:33:27.8798632Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.8799164Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.8799697Z module_map=module_map) 2025-05-07T20:33:27.8800063Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.8800412Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.8800671Z E ^ 2025-05-07T20:33:27.8801192Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.8801664Z 2025-05-07T20:33:27.8802103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.8802689Z 2025-05-07T20:33:27.8802794Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.8803216Z self=, 2025-05-07T20:33:27.8803685Z T=4096, 2025-05-07T20:33:27.8803875Z D=7168, 2025-05-07T20:33:27.8804058Z scale_ub=1200.0, 2025-05-07T20:33:27.8804289Z contiguous=False, 2025-05-07T20:33:27.8804507Z compiled=False, 2025-05-07T20:33:27.8804705Z ) 2025-05-07T20:33:27.8805023Z self = 2025-05-07T20:33:27.8805529Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:27.8805825Z 2025-05-07T20:33:27.8805903Z @given( 2025-05-07T20:33:27.8806133Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:27.8807111Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:27.8807420Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:27.8807763Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:27.8808097Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:27.8808385Z ) 2025-05-07T20:33:27.8808739Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:27.8809194Z def test_silu_mul_quant( 2025-05-07T20:33:27.8809433Z self, 2025-05-07T20:33:27.8809630Z T: int, 2025-05-07T20:33:27.8809821Z D: int, 2025-05-07T20:33:27.8810033Z scale_ub: Optional[float], 2025-05-07T20:33:27.8810309Z contiguous: bool, 2025-05-07T20:33:27.8810546Z compiled: bool, 2025-05-07T20:33:27.8810756Z ) -> None: 2025-05-07T20:33:27.8810972Z torch.manual_seed(2025) 2025-05-07T20:33:27.8811213Z 2025-05-07T20:33:27.8811483Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:27.8811838Z 2025-05-07T20:33:27.8812023Z x_sign = torch.sign(x) 2025-05-07T20:33:27.8812315Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:27.8812622Z x = x_sign * x_clamp 2025-05-07T20:33:27.8812854Z x0 = x[:, :D] 2025-05-07T20:33:27.8813062Z x1 = x[:, D:] 2025-05-07T20:33:27.8813262Z 2025-05-07T20:33:27.8813439Z if contiguous: 2025-05-07T20:33:27.8813666Z x0 = x0.contiguous() 2025-05-07T20:33:27.8813926Z x1 = x1.contiguous() 2025-05-07T20:33:27.8814158Z 2025-05-07T20:33:27.8814345Z if scale_ub is not None: 2025-05-07T20:33:27.8814693Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:27.8815031Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:27.8815347Z ) 2025-05-07T20:33:27.8815542Z else: 2025-05-07T20:33:27.8815741Z scale_ub_tensor = None 2025-05-07T20:33:27.8816000Z 2025-05-07T20:33:27.8816235Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:27.8816545Z op = silu_mul_quant 2025-05-07T20:33:27.8816797Z if compiled: 2025-05-07T20:33:27.8817042Z op = torch.compile(op) 2025-05-07T20:33:27.8817341Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.8817612Z 2025-05-07T20:33:27.8817799Z > y_fp8, y_scale = fn() 2025-05-07T20:33:27.8817963Z 2025-05-07T20:33:27.8818064Z moe/activation_test.py:117: 2025-05-07T20:33:27.8818408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.8818743Z moe/activation_test.py:115: in fn 2025-05-07T20:33:27.8819025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:27.8819777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:27.8820515Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:27.8821073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:27.8821830Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:27.8822526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:27.8823083Z kernel = self.compile( 2025-05-07T20:33:27.8823652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:27.8824338Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:27.8824745Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:27.8824990Z 2025-05-07T20:33:27.8825206Z self = 2025-05-07T20:33:27.8826559Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:27.8827987Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4d5b380>} 2025-05-07T20:33:27.8829391Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:27.8830477Z context = 2025-05-07T20:33:27.8830779Z 2025-05-07T20:33:27.8830949Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:27.8831490Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:27.8831970Z module_map=module_map) 2025-05-07T20:33:27.8832346Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:27.8832708Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:27.8832967Z E ^ 2025-05-07T20:33:27.8833500Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:27.8833980Z 2025-05-07T20:33:27.8834416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:27.8834958Z 2025-05-07T20:33:27.8835070Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:27.8835491Z self=, 2025-05-07T20:33:27.8835907Z T=16384, 2025-05-07T20:33:27.8836108Z D=7168, 2025-05-07T20:33:27.8836290Z scale_ub=None, 2025-05-07T20:33:27.8836498Z contiguous=True, 2025-05-07T20:33:27.8836717Z compiled=True, 2025-05-07T20:33:27.8836911Z ) 2025-05-07T20:33:28.0586459Z self = 2025-05-07T20:33:28.0587248Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:28.0587645Z 2025-05-07T20:33:28.0587752Z @given( 2025-05-07T20:33:28.0588072Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.0588445Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.0588764Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.0589103Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.0589562Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.0589859Z ) 2025-05-07T20:33:28.0590210Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.0590660Z def test_silu_mul_quant( 2025-05-07T20:33:28.0590964Z self, 2025-05-07T20:33:28.0591161Z T: int, 2025-05-07T20:33:28.0591355Z D: int, 2025-05-07T20:33:28.0591568Z scale_ub: Optional[float], 2025-05-07T20:33:28.0591846Z contiguous: bool, 2025-05-07T20:33:28.0592087Z compiled: bool, 2025-05-07T20:33:28.0592438Z ) -> None: 2025-05-07T20:33:28.0592651Z torch.manual_seed(2025) 2025-05-07T20:33:28.0592893Z 2025-05-07T20:33:28.0593163Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.0593514Z 2025-05-07T20:33:28.0593711Z x_sign = torch.sign(x) 2025-05-07T20:33:28.0593995Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.0594310Z x = x_sign * x_clamp 2025-05-07T20:33:28.0594546Z x0 = x[:, :D] 2025-05-07T20:33:28.0594755Z x1 = x[:, D:] 2025-05-07T20:33:28.0594960Z 2025-05-07T20:33:28.0595149Z if contiguous: 2025-05-07T20:33:28.0595381Z x0 = x0.contiguous() 2025-05-07T20:33:28.0595640Z x1 = x1.contiguous() 2025-05-07T20:33:28.0595881Z 2025-05-07T20:33:28.0596072Z if scale_ub is not None: 2025-05-07T20:33:28.0596406Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.0596751Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.0597071Z ) 2025-05-07T20:33:28.0597260Z else: 2025-05-07T20:33:28.0597467Z scale_ub_tensor = None 2025-05-07T20:33:28.0597725Z 2025-05-07T20:33:28.0597950Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.0598268Z op = silu_mul_quant 2025-05-07T20:33:28.0598516Z if compiled: 2025-05-07T20:33:28.0598758Z op = torch.compile(op) 2025-05-07T20:33:28.0599059Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.0599334Z 2025-05-07T20:33:28.0599516Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.0599682Z 2025-05-07T20:33:28.0599779Z moe/activation_test.py:117: 2025-05-07T20:33:28.0600075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.0600406Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.0600689Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.0601264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.0601849Z return fn(*args, **kwargs) 2025-05-07T20:33:28.0602520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.0603233Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.0603784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.0604490Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.0605173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.0605723Z kernel = self.compile( 2025-05-07T20:33:28.0606282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.0606964Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.0607367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.0607605Z 2025-05-07T20:33:28.0607813Z self = 2025-05-07T20:33:28.0608926Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.0610435Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e46a84a0>} 2025-05-07T20:33:28.0611839Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.0612961Z context = 2025-05-07T20:33:28.0613268Z 2025-05-07T20:33:28.0613436Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.0613974Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.0614560Z module_map=module_map) 2025-05-07T20:33:28.0614932Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.0615294Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.0615552Z E ^ 2025-05-07T20:33:28.0616032Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.0616511Z 2025-05-07T20:33:28.0616993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.0617538Z 2025-05-07T20:33:28.0617648Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.0618069Z self=, 2025-05-07T20:33:28.0618488Z T=4096, 2025-05-07T20:33:28.0618678Z D=5120, 2025-05-07T20:33:28.0618864Z scale_ub=None, 2025-05-07T20:33:28.0619076Z contiguous=False, 2025-05-07T20:33:28.0619294Z compiled=True, 2025-05-07T20:33:28.0619485Z ) 2025-05-07T20:33:28.0619807Z self = 2025-05-07T20:33:28.0620332Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:28.0620615Z 2025-05-07T20:33:28.0620690Z @given( 2025-05-07T20:33:28.0620911Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.0621221Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.0621528Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.0621859Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.0622181Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.0622470Z ) 2025-05-07T20:33:28.0622818Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.0623265Z def test_silu_mul_quant( 2025-05-07T20:33:28.0623504Z self, 2025-05-07T20:33:28.0623693Z T: int, 2025-05-07T20:33:28.0623881Z D: int, 2025-05-07T20:33:28.0624093Z scale_ub: Optional[float], 2025-05-07T20:33:28.0624365Z contiguous: bool, 2025-05-07T20:33:28.0624597Z compiled: bool, 2025-05-07T20:33:28.0624807Z ) -> None: 2025-05-07T20:33:28.0625013Z torch.manual_seed(2025) 2025-05-07T20:33:28.0625247Z 2025-05-07T20:33:28.0625694Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.0626043Z 2025-05-07T20:33:28.0626230Z x_sign = torch.sign(x) 2025-05-07T20:33:28.0626514Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.0626821Z x = x_sign * x_clamp 2025-05-07T20:33:28.0627058Z x0 = x[:, :D] 2025-05-07T20:33:28.0627259Z x1 = x[:, D:] 2025-05-07T20:33:28.0627462Z 2025-05-07T20:33:28.0627642Z if contiguous: 2025-05-07T20:33:28.0627866Z x0 = x0.contiguous() 2025-05-07T20:33:28.0628118Z x1 = x1.contiguous() 2025-05-07T20:33:28.0628357Z 2025-05-07T20:33:28.0628538Z if scale_ub is not None: 2025-05-07T20:33:28.0628877Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.0629210Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.0629521Z ) 2025-05-07T20:33:28.0629704Z else: 2025-05-07T20:33:28.0629910Z scale_ub_tensor = None 2025-05-07T20:33:28.0630158Z 2025-05-07T20:33:28.0630437Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.0630757Z op = silu_mul_quant 2025-05-07T20:33:28.0631011Z if compiled: 2025-05-07T20:33:28.0631255Z op = torch.compile(op) 2025-05-07T20:33:28.0631642Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.0631927Z 2025-05-07T20:33:28.0632113Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.0632281Z 2025-05-07T20:33:28.0632381Z moe/activation_test.py:117: 2025-05-07T20:33:28.0632679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.0633014Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.0639979Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.0640612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.0641204Z return fn(*args, **kwargs) 2025-05-07T20:33:28.0641904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.0642635Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.0643296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.0644025Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.0644725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.0645292Z kernel = self.compile( 2025-05-07T20:33:28.0645863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.0646568Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.0646982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.0647224Z 2025-05-07T20:33:28.0647449Z self = 2025-05-07T20:33:28.0648594Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.0650048Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e46a91c0>} 2025-05-07T20:33:28.0651469Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.0652569Z context = 2025-05-07T20:33:28.0652872Z 2025-05-07T20:33:28.0653046Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.0653600Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.0654093Z module_map=module_map) 2025-05-07T20:33:28.0654527Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.0654896Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.0655176Z E ^ 2025-05-07T20:33:28.0655662Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.0656140Z 2025-05-07T20:33:28.0656583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.0657190Z 2025-05-07T20:33:28.4011926Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.4012601Z self=, 2025-05-07T20:33:28.4013208Z T=4096, 2025-05-07T20:33:28.4013474Z D=5120, 2025-05-07T20:33:28.4013935Z scale_ub=1200.0, 2025-05-07T20:33:28.4014258Z contiguous=False, 2025-05-07T20:33:28.4014573Z compiled=False, 2025-05-07T20:33:28.4014782Z ) 2025-05-07T20:33:28.4015112Z self = 2025-05-07T20:33:28.4015715Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:28.4016006Z 2025-05-07T20:33:28.4016087Z @given( 2025-05-07T20:33:28.4016318Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.4016642Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.4016948Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.4017291Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.4017631Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.4017923Z ) 2025-05-07T20:33:28.4018276Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.4018737Z def test_silu_mul_quant( 2025-05-07T20:33:28.4018987Z self, 2025-05-07T20:33:28.4019178Z T: int, 2025-05-07T20:33:28.4019373Z D: int, 2025-05-07T20:33:28.4019661Z scale_ub: Optional[float], 2025-05-07T20:33:28.4019941Z contiguous: bool, 2025-05-07T20:33:28.4020191Z compiled: bool, 2025-05-07T20:33:28.4020424Z ) -> None: 2025-05-07T20:33:28.4020633Z torch.manual_seed(2025) 2025-05-07T20:33:28.4020875Z 2025-05-07T20:33:28.4021151Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.4021501Z 2025-05-07T20:33:28.4021696Z x_sign = torch.sign(x) 2025-05-07T20:33:28.4021998Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.4022308Z x = x_sign * x_clamp 2025-05-07T20:33:28.4022549Z x0 = x[:, :D] 2025-05-07T20:33:28.4022762Z x1 = x[:, D:] 2025-05-07T20:33:28.4022975Z 2025-05-07T20:33:28.4023161Z if contiguous: 2025-05-07T20:33:28.4023401Z x0 = x0.contiguous() 2025-05-07T20:33:28.4023673Z x1 = x1.contiguous() 2025-05-07T20:33:28.4023937Z 2025-05-07T20:33:28.4024160Z if scale_ub is not None: 2025-05-07T20:33:28.4024439Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.4024780Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.4025098Z ) 2025-05-07T20:33:28.4025295Z else: 2025-05-07T20:33:28.4025680Z scale_ub_tensor = None 2025-05-07T20:33:28.4025942Z 2025-05-07T20:33:28.4026181Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.4026497Z op = silu_mul_quant 2025-05-07T20:33:28.4026757Z if compiled: 2025-05-07T20:33:28.4027035Z op = torch.compile(op) 2025-05-07T20:33:28.4027368Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.4027733Z 2025-05-07T20:33:28.4027934Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.4028114Z 2025-05-07T20:33:28.4028220Z moe/activation_test.py:117: 2025-05-07T20:33:28.4028613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.4028965Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.4029302Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.4030108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.4030897Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.4031516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.4032434Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.4033208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.4033894Z kernel = self.compile( 2025-05-07T20:33:28.4034632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.4035423Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.4035833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.4036236Z 2025-05-07T20:33:28.4036455Z self = 2025-05-07T20:33:28.4037680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.4039266Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e46aa160>} 2025-05-07T20:33:28.4040847Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.4042166Z context = 2025-05-07T20:33:28.4042494Z 2025-05-07T20:33:28.4042669Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.4043300Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.4043865Z module_map=module_map) 2025-05-07T20:33:28.4044238Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.4044680Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.4044944Z E ^ 2025-05-07T20:33:28.4045493Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.4045979Z 2025-05-07T20:33:28.4046495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.4047103Z 2025-05-07T20:33:28.4047230Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.4047667Z self=, 2025-05-07T20:33:28.4048153Z T=4096, 2025-05-07T20:33:28.4048355Z D=5120, 2025-05-07T20:33:28.4048559Z scale_ub=1200.0, 2025-05-07T20:33:28.4048840Z contiguous=False, 2025-05-07T20:33:28.4049084Z compiled=True, 2025-05-07T20:33:28.4049300Z ) 2025-05-07T20:33:28.4049675Z self = 2025-05-07T20:33:28.4050242Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:28.4050573Z 2025-05-07T20:33:28.4050653Z @given( 2025-05-07T20:33:28.4050897Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.4051282Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.4051611Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.4052024Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.4052358Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.4052671Z ) 2025-05-07T20:33:28.4053087Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.4053572Z def test_silu_mul_quant( 2025-05-07T20:33:28.4053870Z self, 2025-05-07T20:33:28.4054081Z T: int, 2025-05-07T20:33:28.4054284Z D: int, 2025-05-07T20:33:28.4054709Z scale_ub: Optional[float], 2025-05-07T20:33:28.4054992Z contiguous: bool, 2025-05-07T20:33:28.4055233Z compiled: bool, 2025-05-07T20:33:28.4055655Z ) -> None: 2025-05-07T20:33:28.4055882Z torch.manual_seed(2025) 2025-05-07T20:33:28.4056145Z 2025-05-07T20:33:28.4056426Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.4056786Z 2025-05-07T20:33:28.4056988Z x_sign = torch.sign(x) 2025-05-07T20:33:28.4057328Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.4057657Z x = x_sign * x_clamp 2025-05-07T20:33:28.4057909Z x0 = x[:, :D] 2025-05-07T20:33:28.4058125Z x1 = x[:, D:] 2025-05-07T20:33:28.4058390Z 2025-05-07T20:33:28.4058577Z if contiguous: 2025-05-07T20:33:28.4058810Z x0 = x0.contiguous() 2025-05-07T20:33:28.4059078Z x1 = x1.contiguous() 2025-05-07T20:33:28.4059324Z 2025-05-07T20:33:28.4059510Z if scale_ub is not None: 2025-05-07T20:33:28.4059791Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.4060144Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.4060463Z ) 2025-05-07T20:33:28.4060664Z else: 2025-05-07T20:33:28.4060877Z scale_ub_tensor = None 2025-05-07T20:33:28.4061138Z 2025-05-07T20:33:28.4061374Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.4061702Z op = silu_mul_quant 2025-05-07T20:33:28.4061957Z if compiled: 2025-05-07T20:33:28.4062202Z op = torch.compile(op) 2025-05-07T20:33:28.4062557Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.4062853Z 2025-05-07T20:33:28.4063047Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.4063223Z 2025-05-07T20:33:28.4063324Z moe/activation_test.py:117: 2025-05-07T20:33:28.4063634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.4063972Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.4064266Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.4064856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.4065451Z return fn(*args, **kwargs) 2025-05-07T20:33:28.4066137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.4066872Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.4067437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.4068166Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.4068869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.4069432Z kernel = self.compile( 2025-05-07T20:33:28.4069992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.4070681Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.4071097Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.4071341Z 2025-05-07T20:33:28.4071551Z self = 2025-05-07T20:33:28.4072687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.4074172Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e46ab240>} 2025-05-07T20:33:28.4075581Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.4076722Z context = 2025-05-07T20:33:28.4077024Z 2025-05-07T20:33:28.4077198Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.4077815Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.4078313Z module_map=module_map) 2025-05-07T20:33:28.4078694Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.4079057Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.4079373Z E ^ 2025-05-07T20:33:28.4079862Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.4080339Z 2025-05-07T20:33:28.4080778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.4081331Z 2025-05-07T20:33:28.5233687Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.5234413Z self=, 2025-05-07T20:33:28.5235032Z T=2048, 2025-05-07T20:33:28.5235297Z D=7168, 2025-05-07T20:33:28.5235550Z scale_ub=1200.0, 2025-05-07T20:33:28.5235787Z contiguous=False, 2025-05-07T20:33:28.5236024Z compiled=False, 2025-05-07T20:33:28.5236242Z ) 2025-05-07T20:33:28.5236578Z self = 2025-05-07T20:33:28.5237260Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:28.5237561Z 2025-05-07T20:33:28.5237640Z @given( 2025-05-07T20:33:28.5237874Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.5238184Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.5238496Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.5238834Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.5239172Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.5239457Z ) 2025-05-07T20:33:28.5239818Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.5240278Z def test_silu_mul_quant( 2025-05-07T20:33:28.5240521Z self, 2025-05-07T20:33:28.5240719Z T: int, 2025-05-07T20:33:28.5240925Z D: int, 2025-05-07T20:33:28.5241141Z scale_ub: Optional[float], 2025-05-07T20:33:28.5241421Z contiguous: bool, 2025-05-07T20:33:28.5241667Z compiled: bool, 2025-05-07T20:33:28.5241895Z ) -> None: 2025-05-07T20:33:28.5242126Z torch.manual_seed(2025) 2025-05-07T20:33:28.5242372Z 2025-05-07T20:33:28.5242646Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.5243006Z 2025-05-07T20:33:28.5243209Z x_sign = torch.sign(x) 2025-05-07T20:33:28.5243504Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.5243825Z x = x_sign * x_clamp 2025-05-07T20:33:28.5244074Z x0 = x[:, :D] 2025-05-07T20:33:28.5244296Z x1 = x[:, D:] 2025-05-07T20:33:28.5244502Z 2025-05-07T20:33:28.5244694Z if contiguous: 2025-05-07T20:33:28.5244933Z x0 = x0.contiguous() 2025-05-07T20:33:28.5245382Z x1 = x1.contiguous() 2025-05-07T20:33:28.5245639Z 2025-05-07T20:33:28.5245839Z if scale_ub is not None: 2025-05-07T20:33:28.5246114Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.5246457Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.5246782Z ) 2025-05-07T20:33:28.5246972Z else: 2025-05-07T20:33:28.5247191Z scale_ub_tensor = None 2025-05-07T20:33:28.5247451Z 2025-05-07T20:33:28.5247680Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.5248009Z op = silu_mul_quant 2025-05-07T20:33:28.5248266Z if compiled: 2025-05-07T20:33:28.5248511Z op = torch.compile(op) 2025-05-07T20:33:28.5248904Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.5249192Z 2025-05-07T20:33:28.5249393Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.5249566Z 2025-05-07T20:33:28.5249672Z moe/activation_test.py:117: 2025-05-07T20:33:28.5250050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.5250399Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.5250688Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.5251422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.5252230Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.5252790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.5253517Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.5254223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.5254904Z kernel = self.compile( 2025-05-07T20:33:28.5255469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.5256179Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.5256596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.5256884Z 2025-05-07T20:33:28.5257110Z self = 2025-05-07T20:33:28.5258234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.5259685Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e51a4220>} 2025-05-07T20:33:28.5261116Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.5262210Z context = 2025-05-07T20:33:28.5262514Z 2025-05-07T20:33:28.5262699Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.5263247Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.5263748Z module_map=module_map) 2025-05-07T20:33:28.5264159Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.5264537Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.5264810Z E ^ 2025-05-07T20:33:28.5265300Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.5265779Z 2025-05-07T20:33:28.5266227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.5266778Z 2025-05-07T20:33:28.5266895Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.5267326Z self=, 2025-05-07T20:33:28.5267756Z T=1, 2025-05-07T20:33:28.5267958Z D=7168, 2025-05-07T20:33:28.5268156Z scale_ub=None, 2025-05-07T20:33:28.5268390Z contiguous=True, 2025-05-07T20:33:28.5268626Z compiled=False, 2025-05-07T20:33:28.5268835Z ) 2025-05-07T20:33:28.5269173Z self = 2025-05-07T20:33:28.5269687Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:28.5269961Z 2025-05-07T20:33:28.5270047Z @given( 2025-05-07T20:33:28.5270338Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.5270674Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.5271000Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.5271337Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.5271726Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.5272026Z ) 2025-05-07T20:33:28.5272388Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.5272856Z def test_silu_mul_quant( 2025-05-07T20:33:28.5273154Z self, 2025-05-07T20:33:28.5273355Z T: int, 2025-05-07T20:33:28.5273563Z D: int, 2025-05-07T20:33:28.5273796Z scale_ub: Optional[float], 2025-05-07T20:33:28.5274082Z contiguous: bool, 2025-05-07T20:33:28.5274337Z compiled: bool, 2025-05-07T20:33:28.5274572Z ) -> None: 2025-05-07T20:33:28.5274790Z torch.manual_seed(2025) 2025-05-07T20:33:28.5275044Z 2025-05-07T20:33:28.5275331Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.5275690Z 2025-05-07T20:33:28.5275881Z x_sign = torch.sign(x) 2025-05-07T20:33:28.5276180Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.5276509Z x = x_sign * x_clamp 2025-05-07T20:33:28.5276747Z x0 = x[:, :D] 2025-05-07T20:33:28.5276975Z x1 = x[:, D:] 2025-05-07T20:33:28.5277192Z 2025-05-07T20:33:28.5277425Z if contiguous: 2025-05-07T20:33:28.5277666Z x0 = x0.contiguous() 2025-05-07T20:33:28.5277939Z x1 = x1.contiguous() 2025-05-07T20:33:28.5278179Z 2025-05-07T20:33:28.5278379Z if scale_ub is not None: 2025-05-07T20:33:28.5278661Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.5278997Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.5279312Z ) 2025-05-07T20:33:28.5279510Z else: 2025-05-07T20:33:28.5279725Z scale_ub_tensor = None 2025-05-07T20:33:28.5279990Z 2025-05-07T20:33:28.5280233Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.5280557Z op = silu_mul_quant 2025-05-07T20:33:28.5280804Z if compiled: 2025-05-07T20:33:28.5281061Z op = torch.compile(op) 2025-05-07T20:33:28.5281367Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.5281648Z 2025-05-07T20:33:28.5281846Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.5282016Z 2025-05-07T20:33:28.5282120Z moe/activation_test.py:117: 2025-05-07T20:33:28.5282416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.5282758Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.5283051Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.5283768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.5284500Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.5285065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.5285788Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.5286484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.5287040Z kernel = self.compile( 2025-05-07T20:33:28.5287606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.5288299Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.5288704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.5288947Z 2025-05-07T20:33:28.5289157Z self = 2025-05-07T20:33:28.5290281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.5291799Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e51a5120>} 2025-05-07T20:33:28.5293216Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.5294378Z context = 2025-05-07T20:33:28.5294764Z 2025-05-07T20:33:28.5294937Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.5295484Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.5295969Z module_map=module_map) 2025-05-07T20:33:28.5296345Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.5296715Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.5296978Z E ^ 2025-05-07T20:33:28.5297467Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.5297944Z 2025-05-07T20:33:28.5298425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.5298973Z 2025-05-07T20:33:28.5299089Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.5299516Z self=, 2025-05-07T20:33:28.5299938Z T=16384, 2025-05-07T20:33:28.5300139Z D=7168, 2025-05-07T20:33:28.5300328Z scale_ub=1200.0, 2025-05-07T20:33:28.5300556Z contiguous=False, 2025-05-07T20:33:28.5300784Z compiled=True, 2025-05-07T20:33:28.7709819Z ) 2025-05-07T20:33:28.7710641Z self = 2025-05-07T20:33:28.7711453Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:28.7711869Z 2025-05-07T20:33:28.7711986Z @given( 2025-05-07T20:33:28.7712277Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7712604Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7712935Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7713290Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7713632Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7713938Z ) 2025-05-07T20:33:28.7714304Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7714766Z def test_silu_mul_quant( 2025-05-07T20:33:28.7715023Z self, 2025-05-07T20:33:28.7715237Z T: int, 2025-05-07T20:33:28.7715440Z D: int, 2025-05-07T20:33:28.7715675Z scale_ub: Optional[float], 2025-05-07T20:33:28.7715963Z contiguous: bool, 2025-05-07T20:33:28.7716215Z compiled: bool, 2025-05-07T20:33:28.7716444Z ) -> None: 2025-05-07T20:33:28.7716675Z torch.manual_seed(2025) 2025-05-07T20:33:28.7716933Z 2025-05-07T20:33:28.7717217Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7717582Z 2025-05-07T20:33:28.7717796Z x_sign = torch.sign(x) 2025-05-07T20:33:28.7718094Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.7718428Z x = x_sign * x_clamp 2025-05-07T20:33:28.7718680Z x0 = x[:, :D] 2025-05-07T20:33:28.7718904Z x1 = x[:, D:] 2025-05-07T20:33:28.7719128Z 2025-05-07T20:33:28.7719333Z if contiguous: 2025-05-07T20:33:28.7719579Z x0 = x0.contiguous() 2025-05-07T20:33:28.7719855Z x1 = x1.contiguous() 2025-05-07T20:33:28.7720237Z 2025-05-07T20:33:28.7720431Z if scale_ub is not None: 2025-05-07T20:33:28.7720718Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.7721067Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.7721383Z ) 2025-05-07T20:33:28.7721587Z else: 2025-05-07T20:33:28.7721873Z scale_ub_tensor = None 2025-05-07T20:33:28.7722131Z 2025-05-07T20:33:28.7722369Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.7722707Z op = silu_mul_quant 2025-05-07T20:33:28.7723034Z if compiled: 2025-05-07T20:33:28.7723284Z op = torch.compile(op) 2025-05-07T20:33:28.7723606Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7723900Z 2025-05-07T20:33:28.7724099Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.7724282Z 2025-05-07T20:33:28.7724391Z moe/activation_test.py:117: 2025-05-07T20:33:28.7724702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7725048Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.7725347Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7726239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.7726844Z return fn(*args, **kwargs) 2025-05-07T20:33:28.7727618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.7728357Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.7728932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.7729654Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.7730365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.7730943Z kernel = self.compile( 2025-05-07T20:33:28.7731521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.7732215Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.7732647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7732888Z 2025-05-07T20:33:28.7733119Z self = 2025-05-07T20:33:28.7734262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.7735815Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e51a6520>} 2025-05-07T20:33:28.7737236Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.7738334Z context = 2025-05-07T20:33:28.7738637Z 2025-05-07T20:33:28.7738828Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.7739377Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.7739879Z module_map=module_map) 2025-05-07T20:33:28.7740273Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.7740654Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.7740932Z E ^ 2025-05-07T20:33:28.7741428Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.7741906Z 2025-05-07T20:33:28.7742425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.7742980Z 2025-05-07T20:33:28.7743102Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7743602Z self=, 2025-05-07T20:33:28.7744038Z T=1, 2025-05-07T20:33:28.7744242Z D=7168, 2025-05-07T20:33:28.7744446Z scale_ub=None, 2025-05-07T20:33:28.7744680Z contiguous=False, 2025-05-07T20:33:28.7744931Z compiled=False, 2025-05-07T20:33:28.7745207Z ) 2025-05-07T20:33:28.7745550Z self = 2025-05-07T20:33:28.7746070Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:28.7746348Z 2025-05-07T20:33:28.7746434Z @given( 2025-05-07T20:33:28.7746679Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.7747021Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.7747358Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.7747709Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.7748067Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.7748385Z ) 2025-05-07T20:33:28.7748756Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.7749236Z def test_silu_mul_quant( 2025-05-07T20:33:28.7749500Z self, 2025-05-07T20:33:28.7749761Z T: int, 2025-05-07T20:33:28.7749987Z D: int, 2025-05-07T20:33:28.7750230Z scale_ub: Optional[float], 2025-05-07T20:33:28.7750507Z contiguous: bool, 2025-05-07T20:33:28.7750762Z compiled: bool, 2025-05-07T20:33:28.7751002Z ) -> None: 2025-05-07T20:33:28.7751223Z torch.manual_seed(2025) 2025-05-07T20:33:28.7751476Z 2025-05-07T20:33:28.7751773Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.7752152Z 2025-05-07T20:33:28.7752354Z x_sign = torch.sign(x) 2025-05-07T20:33:28.7752668Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.7752999Z x = x_sign * x_clamp 2025-05-07T20:33:28.7753253Z x0 = x[:, :D] 2025-05-07T20:33:28.7753490Z x1 = x[:, D:] 2025-05-07T20:33:28.7753712Z 2025-05-07T20:33:28.7753896Z if contiguous: 2025-05-07T20:33:28.7754141Z x0 = x0.contiguous() 2025-05-07T20:33:28.7754414Z x1 = x1.contiguous() 2025-05-07T20:33:28.7754663Z 2025-05-07T20:33:28.7754866Z if scale_ub is not None: 2025-05-07T20:33:28.7755152Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.7755491Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.7755817Z ) 2025-05-07T20:33:28.7756019Z else: 2025-05-07T20:33:28.7756229Z scale_ub_tensor = None 2025-05-07T20:33:28.7756492Z 2025-05-07T20:33:28.7756733Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.7757060Z op = silu_mul_quant 2025-05-07T20:33:28.7757307Z if compiled: 2025-05-07T20:33:28.7757559Z op = torch.compile(op) 2025-05-07T20:33:28.7757864Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7758137Z 2025-05-07T20:33:28.7758334Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.7758498Z 2025-05-07T20:33:28.7758606Z moe/activation_test.py:117: 2025-05-07T20:33:28.7758901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7759249Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.7759536Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.7760259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.7760986Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.7761550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.7762327Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.7763024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.7763628Z kernel = self.compile( 2025-05-07T20:33:28.7764205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.7764899Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.7765349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.7765595Z 2025-05-07T20:33:28.7765811Z self = 2025-05-07T20:33:28.7766937Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.7768367Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e51a7100>} 2025-05-07T20:33:28.7769830Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.7770919Z context = 2025-05-07T20:33:28.7771222Z 2025-05-07T20:33:28.7771392Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.7771938Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.7772420Z module_map=module_map) 2025-05-07T20:33:28.7772793Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.7773157Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.7773419Z E ^ 2025-05-07T20:33:28.7773907Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.7774383Z 2025-05-07T20:33:28.7774911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.7775458Z 2025-05-07T20:33:28.7775571Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.7776003Z self=, 2025-05-07T20:33:28.7776425Z T=2048, 2025-05-07T20:33:28.7776619Z D=7168, 2025-05-07T20:33:28.7776814Z scale_ub=None, 2025-05-07T20:33:28.7777038Z contiguous=False, 2025-05-07T20:33:28.7777269Z compiled=True, 2025-05-07T20:33:28.7777481Z ) 2025-05-07T20:33:28.8651628Z self = 2025-05-07T20:33:28.8652531Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:28.8652955Z 2025-05-07T20:33:28.8653068Z @given( 2025-05-07T20:33:28.8653316Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.8653646Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.8654001Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.8654340Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.8654741Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.8655031Z ) 2025-05-07T20:33:28.8655392Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.8655853Z def test_silu_mul_quant( 2025-05-07T20:33:28.8656094Z self, 2025-05-07T20:33:28.8656299Z T: int, 2025-05-07T20:33:28.8656504Z D: int, 2025-05-07T20:33:28.8656730Z scale_ub: Optional[float], 2025-05-07T20:33:28.8657119Z contiguous: bool, 2025-05-07T20:33:28.8657368Z compiled: bool, 2025-05-07T20:33:28.8657596Z ) -> None: 2025-05-07T20:33:28.8657807Z torch.manual_seed(2025) 2025-05-07T20:33:28.8658043Z 2025-05-07T20:33:28.8658323Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.8658771Z 2025-05-07T20:33:28.8658964Z x_sign = torch.sign(x) 2025-05-07T20:33:28.8659257Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.8659577Z x = x_sign * x_clamp 2025-05-07T20:33:28.8659813Z x0 = x[:, :D] 2025-05-07T20:33:28.8660093Z x1 = x[:, D:] 2025-05-07T20:33:28.8660299Z 2025-05-07T20:33:28.8660493Z if contiguous: 2025-05-07T20:33:28.8660728Z x0 = x0.contiguous() 2025-05-07T20:33:28.8660992Z x1 = x1.contiguous() 2025-05-07T20:33:28.8661247Z 2025-05-07T20:33:28.8661434Z if scale_ub is not None: 2025-05-07T20:33:28.8661711Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.8662049Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.8662358Z ) 2025-05-07T20:33:28.8662545Z else: 2025-05-07T20:33:28.8662755Z scale_ub_tensor = None 2025-05-07T20:33:28.8663011Z 2025-05-07T20:33:28.8663246Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.8663564Z op = silu_mul_quant 2025-05-07T20:33:28.8663812Z if compiled: 2025-05-07T20:33:28.8664125Z op = torch.compile(op) 2025-05-07T20:33:28.8664429Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.8664704Z 2025-05-07T20:33:28.8664898Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.8665070Z 2025-05-07T20:33:28.8665168Z moe/activation_test.py:117: 2025-05-07T20:33:28.8665466Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.8665798Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.8666085Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.8666670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.8667255Z return fn(*args, **kwargs) 2025-05-07T20:33:28.8667946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.8668670Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.8669236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.8669942Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.8670639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.8671196Z kernel = self.compile( 2025-05-07T20:33:28.8671750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.8672438Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.8672840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.8673075Z 2025-05-07T20:33:28.8673294Z self = 2025-05-07T20:33:28.8674404Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.8675833Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4704720>} 2025-05-07T20:33:28.8677236Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.8678370Z context = 2025-05-07T20:33:28.8678669Z 2025-05-07T20:33:28.8678841Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.8679412Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.8679894Z module_map=module_map) 2025-05-07T20:33:28.8680272Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.8680665Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.8680926Z E ^ 2025-05-07T20:33:28.8681402Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.8681871Z 2025-05-07T20:33:28.8682313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.8682857Z 2025-05-07T20:33:28.8682959Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:28.8683377Z self=, 2025-05-07T20:33:28.8683791Z T=4096, 2025-05-07T20:33:28.8683971Z D=7168, 2025-05-07T20:33:28.8684165Z scale_ub=None, 2025-05-07T20:33:28.8684388Z contiguous=False, 2025-05-07T20:33:28.8684608Z compiled=True, 2025-05-07T20:33:28.8684808Z ) 2025-05-07T20:33:28.8685180Z self = 2025-05-07T20:33:28.8685694Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:28.8685974Z 2025-05-07T20:33:28.8686054Z @given( 2025-05-07T20:33:28.8686285Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:28.8686605Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:28.8686908Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:28.8687244Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:28.8687579Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:28.8687865Z ) 2025-05-07T20:33:28.8688219Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:28.8688672Z def test_silu_mul_quant( 2025-05-07T20:33:28.8688917Z self, 2025-05-07T20:33:28.8689103Z T: int, 2025-05-07T20:33:28.8689298Z D: int, 2025-05-07T20:33:28.8689527Z scale_ub: Optional[float], 2025-05-07T20:33:28.8689798Z contiguous: bool, 2025-05-07T20:33:28.8690044Z compiled: bool, 2025-05-07T20:33:28.8690265Z ) -> None: 2025-05-07T20:33:28.8690473Z torch.manual_seed(2025) 2025-05-07T20:33:28.8690718Z 2025-05-07T20:33:28.8690991Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:28.8691337Z 2025-05-07T20:33:28.8691527Z x_sign = torch.sign(x) 2025-05-07T20:33:28.8691817Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:28.8692128Z x = x_sign * x_clamp 2025-05-07T20:33:28.8692368Z x0 = x[:, :D] 2025-05-07T20:33:28.8692584Z x1 = x[:, D:] 2025-05-07T20:33:28.8692790Z 2025-05-07T20:33:28.8692975Z if contiguous: 2025-05-07T20:33:28.8693212Z x0 = x0.contiguous() 2025-05-07T20:33:28.8693468Z x1 = x1.contiguous() 2025-05-07T20:33:28.8693737Z 2025-05-07T20:33:28.8693954Z if scale_ub is not None: 2025-05-07T20:33:28.8694233Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:28.8694643Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:28.8694960Z ) 2025-05-07T20:33:28.8695149Z else: 2025-05-07T20:33:28.8695352Z scale_ub_tensor = None 2025-05-07T20:33:28.8695604Z 2025-05-07T20:33:28.8695833Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:28.8696144Z op = silu_mul_quant 2025-05-07T20:33:28.8696394Z if compiled: 2025-05-07T20:33:28.8696692Z op = torch.compile(op) 2025-05-07T20:33:28.8696992Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.8697270Z 2025-05-07T20:33:28.8697454Z > y_fp8, y_scale = fn() 2025-05-07T20:33:28.8697621Z 2025-05-07T20:33:28.8697719Z moe/activation_test.py:117: 2025-05-07T20:33:28.8698064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.8698399Z moe/activation_test.py:115: in fn 2025-05-07T20:33:28.8698682Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:28.8699293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:28.8699875Z return fn(*args, **kwargs) 2025-05-07T20:33:28.8700557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:28.8701268Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:28.8701833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:28.8702553Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:28.8703252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:28.8703843Z kernel = self.compile( 2025-05-07T20:33:28.8704464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:28.8705155Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:28.8705562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:28.8705799Z 2025-05-07T20:33:28.8706008Z self = 2025-05-07T20:33:28.8707130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:28.8708564Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4705440>} 2025-05-07T20:33:28.8709970Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:28.8711044Z context = 2025-05-07T20:33:28.8711346Z 2025-05-07T20:33:28.8711511Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:28.8712048Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:28.8712530Z module_map=module_map) 2025-05-07T20:33:28.8712903Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:28.8713264Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:28.8713529Z E ^ 2025-05-07T20:33:28.8714005Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:28.8714532Z 2025-05-07T20:33:28.8714968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:28.8715516Z 2025-05-07T20:33:29.0314954Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.0316225Z self=, 2025-05-07T20:33:29.0317360Z T=16384, 2025-05-07T20:33:29.0317855Z D=5120, 2025-05-07T20:33:29.0318229Z scale_ub=1200.0, 2025-05-07T20:33:29.0318669Z contiguous=False, 2025-05-07T20:33:29.0319108Z compiled=False, 2025-05-07T20:33:29.0319490Z ) 2025-05-07T20:33:29.0320336Z self = 2025-05-07T20:33:29.0321357Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:29.0321934Z 2025-05-07T20:33:29.0322090Z @given( 2025-05-07T20:33:29.0322646Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.0323278Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.0323885Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.0324350Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.0324771Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.0325063Z ) 2025-05-07T20:33:29.0325593Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.0326057Z def test_silu_mul_quant( 2025-05-07T20:33:29.0326298Z self, 2025-05-07T20:33:29.0326488Z T: int, 2025-05-07T20:33:29.0326685Z D: int, 2025-05-07T20:33:29.0326907Z scale_ub: Optional[float], 2025-05-07T20:33:29.0327183Z contiguous: bool, 2025-05-07T20:33:29.0327421Z compiled: bool, 2025-05-07T20:33:29.0327646Z ) -> None: 2025-05-07T20:33:29.0327860Z torch.manual_seed(2025) 2025-05-07T20:33:29.0328095Z 2025-05-07T20:33:29.0328373Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.0328720Z 2025-05-07T20:33:29.0328906Z x_sign = torch.sign(x) 2025-05-07T20:33:29.0329262Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.0329579Z x = x_sign * x_clamp 2025-05-07T20:33:29.0329814Z x0 = x[:, :D] 2025-05-07T20:33:29.0330028Z x1 = x[:, D:] 2025-05-07T20:33:29.0330233Z 2025-05-07T20:33:29.0330417Z if contiguous: 2025-05-07T20:33:29.0330653Z x0 = x0.contiguous() 2025-05-07T20:33:29.0330915Z x1 = x1.contiguous() 2025-05-07T20:33:29.0331154Z 2025-05-07T20:33:29.0331344Z if scale_ub is not None: 2025-05-07T20:33:29.0331619Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.0331948Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.0332261Z ) 2025-05-07T20:33:29.0332449Z else: 2025-05-07T20:33:29.0332661Z scale_ub_tensor = None 2025-05-07T20:33:29.0332913Z 2025-05-07T20:33:29.0333149Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.0333463Z op = silu_mul_quant 2025-05-07T20:33:29.0333715Z if compiled: 2025-05-07T20:33:29.0333960Z op = torch.compile(op) 2025-05-07T20:33:29.0341433Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.0341747Z 2025-05-07T20:33:29.0341942Z > y_fp8, y_scale = fn() 2025-05-07T20:33:29.0342112Z 2025-05-07T20:33:29.0342217Z moe/activation_test.py:117: 2025-05-07T20:33:29.0342511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.0342854Z moe/activation_test.py:115: in fn 2025-05-07T20:33:29.0343137Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.0343853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:29.0344573Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:29.0345129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.0345842Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.0346530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.0347077Z kernel = self.compile( 2025-05-07T20:33:29.0347639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.0348326Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.0348842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.0349082Z 2025-05-07T20:33:29.0349290Z self = 2025-05-07T20:33:29.0350475Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.0351905Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4706340>} 2025-05-07T20:33:29.0353374Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.0354450Z context = 2025-05-07T20:33:29.0354753Z 2025-05-07T20:33:29.0354925Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.0355460Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.0355948Z module_map=module_map) 2025-05-07T20:33:29.0356308Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.0356708Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:29.0356970Z E ^ 2025-05-07T20:33:29.0357440Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.0357917Z 2025-05-07T20:33:29.0358355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.0358900Z 2025-05-07T20:33:29.0359002Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.0359428Z self=, 2025-05-07T20:33:29.0359840Z T=16384, 2025-05-07T20:33:29.0360033Z D=5120, 2025-05-07T20:33:29.0360221Z scale_ub=1200.0, 2025-05-07T20:33:29.0360435Z contiguous=True, 2025-05-07T20:33:29.0360651Z compiled=True, 2025-05-07T20:33:29.0360849Z ) 2025-05-07T20:33:29.0361161Z self = 2025-05-07T20:33:29.0361671Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:29.0361953Z 2025-05-07T20:33:29.0362041Z @given( 2025-05-07T20:33:29.0362262Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.0362575Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.0362882Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.0363208Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.0363536Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.0363830Z ) 2025-05-07T20:33:29.0364180Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.0364627Z def test_silu_mul_quant( 2025-05-07T20:33:29.0364866Z self, 2025-05-07T20:33:29.0365059Z T: int, 2025-05-07T20:33:29.0365248Z D: int, 2025-05-07T20:33:29.0365464Z scale_ub: Optional[float], 2025-05-07T20:33:29.0365733Z contiguous: bool, 2025-05-07T20:33:29.0365962Z compiled: bool, 2025-05-07T20:33:29.0366183Z ) -> None: 2025-05-07T20:33:29.0366389Z torch.manual_seed(2025) 2025-05-07T20:33:29.0366628Z 2025-05-07T20:33:29.0366898Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.0367250Z 2025-05-07T20:33:29.0367435Z x_sign = torch.sign(x) 2025-05-07T20:33:29.0367728Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.0368044Z x = x_sign * x_clamp 2025-05-07T20:33:29.0368281Z x0 = x[:, :D] 2025-05-07T20:33:29.0368543Z x1 = x[:, D:] 2025-05-07T20:33:29.0368749Z 2025-05-07T20:33:29.0368927Z if contiguous: 2025-05-07T20:33:29.0369149Z x0 = x0.contiguous() 2025-05-07T20:33:29.0369410Z x1 = x1.contiguous() 2025-05-07T20:33:29.0369653Z 2025-05-07T20:33:29.0369882Z if scale_ub is not None: 2025-05-07T20:33:29.0370157Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.0370499Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.0370802Z ) 2025-05-07T20:33:29.0371029Z else: 2025-05-07T20:33:29.0371238Z scale_ub_tensor = None 2025-05-07T20:33:29.0371489Z 2025-05-07T20:33:29.0371723Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.0372040Z op = silu_mul_quant 2025-05-07T20:33:29.0372284Z if compiled: 2025-05-07T20:33:29.0372527Z op = torch.compile(op) 2025-05-07T20:33:29.0372822Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.0373100Z 2025-05-07T20:33:29.0373284Z > y_fp8, y_scale = fn() 2025-05-07T20:33:29.0373453Z 2025-05-07T20:33:29.0373546Z moe/activation_test.py:117: 2025-05-07T20:33:29.0373846Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.0374187Z moe/activation_test.py:115: in fn 2025-05-07T20:33:29.0374611Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.0375230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:29.0375812Z return fn(*args, **kwargs) 2025-05-07T20:33:29.0376491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:29.0377213Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:29.0377764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.0378472Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.0379161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.0379715Z kernel = self.compile( 2025-05-07T20:33:29.0380276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.0380955Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.0381361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.0381604Z 2025-05-07T20:33:29.0381814Z self = 2025-05-07T20:33:29.0382930Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.0384378Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e47079c0>} 2025-05-07T20:33:29.0385807Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.0386891Z context = 2025-05-07T20:33:29.0387187Z 2025-05-07T20:33:29.0387360Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.0387893Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.0388377Z module_map=module_map) 2025-05-07T20:33:29.0388743Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.0389148Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:29.0389408Z E ^ 2025-05-07T20:33:29.0389887Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.0390357Z 2025-05-07T20:33:29.0390836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.0391377Z 2025-05-07T20:33:29.2089255Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.2090490Z self=, 2025-05-07T20:33:29.2091861Z T=16384, 2025-05-07T20:33:29.2092287Z D=5120, 2025-05-07T20:33:29.2092656Z scale_ub=None, 2025-05-07T20:33:29.2093066Z contiguous=False, 2025-05-07T20:33:29.2093507Z compiled=True, 2025-05-07T20:33:29.2093898Z ) 2025-05-07T20:33:29.2094254Z self = 2025-05-07T20:33:29.2094887Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:29.2095184Z 2025-05-07T20:33:29.2095268Z @given( 2025-05-07T20:33:29.2095505Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.2095823Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.2096145Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.2096482Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.2096888Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.2097181Z ) 2025-05-07T20:33:29.2097533Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.2097977Z def test_silu_mul_quant( 2025-05-07T20:33:29.2098241Z self, 2025-05-07T20:33:29.2098434Z T: int, 2025-05-07T20:33:29.2098628Z D: int, 2025-05-07T20:33:29.2098840Z scale_ub: Optional[float], 2025-05-07T20:33:29.2099115Z contiguous: bool, 2025-05-07T20:33:29.2099351Z compiled: bool, 2025-05-07T20:33:29.2099570Z ) -> None: 2025-05-07T20:33:29.2099780Z torch.manual_seed(2025) 2025-05-07T20:33:29.2100021Z 2025-05-07T20:33:29.2100292Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.2100642Z 2025-05-07T20:33:29.2100842Z x_sign = torch.sign(x) 2025-05-07T20:33:29.2101132Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.2101440Z x = x_sign * x_clamp 2025-05-07T20:33:29.2101682Z x0 = x[:, :D] 2025-05-07T20:33:29.2101899Z x1 = x[:, D:] 2025-05-07T20:33:29.2102103Z 2025-05-07T20:33:29.2102282Z if contiguous: 2025-05-07T20:33:29.2102519Z x0 = x0.contiguous() 2025-05-07T20:33:29.2102774Z x1 = x1.contiguous() 2025-05-07T20:33:29.2103012Z 2025-05-07T20:33:29.2103202Z if scale_ub is not None: 2025-05-07T20:33:29.2103470Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.2103803Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.2104117Z ) 2025-05-07T20:33:29.2104305Z else: 2025-05-07T20:33:29.2104506Z scale_ub_tensor = None 2025-05-07T20:33:29.2104764Z 2025-05-07T20:33:29.2104998Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.2105313Z op = silu_mul_quant 2025-05-07T20:33:29.2105562Z if compiled: 2025-05-07T20:33:29.2105807Z op = torch.compile(op) 2025-05-07T20:33:29.2106101Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.2106383Z 2025-05-07T20:33:29.2106568Z > y_fp8, y_scale = fn() 2025-05-07T20:33:29.2106734Z 2025-05-07T20:33:29.2106829Z moe/activation_test.py:117: 2025-05-07T20:33:29.2108597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.2108931Z moe/activation_test.py:115: in fn 2025-05-07T20:33:29.2109212Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.2109863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:29.2110453Z return fn(*args, **kwargs) 2025-05-07T20:33:29.2111194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:29.2111918Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:29.2112480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.2113197Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.2113960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.2114516Z kernel = self.compile( 2025-05-07T20:33:29.2115075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.2115767Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.2116181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.2116426Z 2025-05-07T20:33:29.2116641Z self = 2025-05-07T20:33:29.2117815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.2119255Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e5484c20>} 2025-05-07T20:33:29.2120667Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.2121753Z context = 2025-05-07T20:33:29.2122059Z 2025-05-07T20:33:29.2122230Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.2122774Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.2123260Z module_map=module_map) 2025-05-07T20:33:29.2123633Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.2123997Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:29.2124267Z E ^ 2025-05-07T20:33:29.2124743Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.2125218Z 2025-05-07T20:33:29.2125843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.2126393Z 2025-05-07T20:33:29.2126502Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.2126932Z self=, 2025-05-07T20:33:29.2127345Z T=2048, 2025-05-07T20:33:29.2127535Z D=5120, 2025-05-07T20:33:29.2127729Z scale_ub=None, 2025-05-07T20:33:29.2127943Z contiguous=False, 2025-05-07T20:33:29.2128174Z compiled=True, 2025-05-07T20:33:29.2128375Z ) 2025-05-07T20:33:29.3032305Z self = 2025-05-07T20:33:29.3033851Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:29.3034355Z 2025-05-07T20:33:29.3034464Z @given( 2025-05-07T20:33:29.3034745Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.3035068Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.3035377Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.3035715Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.3036153Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.3036438Z ) 2025-05-07T20:33:29.3036783Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.3037232Z def test_silu_mul_quant( 2025-05-07T20:33:29.3037472Z self, 2025-05-07T20:33:29.3037727Z T: int, 2025-05-07T20:33:29.3037933Z D: int, 2025-05-07T20:33:29.3038156Z scale_ub: Optional[float], 2025-05-07T20:33:29.3038427Z contiguous: bool, 2025-05-07T20:33:29.3038673Z compiled: bool, 2025-05-07T20:33:29.3038958Z ) -> None: 2025-05-07T20:33:29.3039166Z torch.manual_seed(2025) 2025-05-07T20:33:29.3039409Z 2025-05-07T20:33:29.3039680Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.3040022Z 2025-05-07T20:33:29.3040211Z x_sign = torch.sign(x) 2025-05-07T20:33:29.3040503Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.3040817Z x = x_sign * x_clamp 2025-05-07T20:33:29.3041047Z x0 = x[:, :D] 2025-05-07T20:33:29.3041255Z x1 = x[:, D:] 2025-05-07T20:33:29.3041460Z 2025-05-07T20:33:29.3041633Z if contiguous: 2025-05-07T20:33:29.3041863Z x0 = x0.contiguous() 2025-05-07T20:33:29.3042120Z x1 = x1.contiguous() 2025-05-07T20:33:29.3042359Z 2025-05-07T20:33:29.3042544Z if scale_ub is not None: 2025-05-07T20:33:29.3042813Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.3043210Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.3043528Z ) 2025-05-07T20:33:29.3043721Z else: 2025-05-07T20:33:29.3043924Z scale_ub_tensor = None 2025-05-07T20:33:29.3044177Z 2025-05-07T20:33:29.3044404Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.3044715Z op = silu_mul_quant 2025-05-07T20:33:29.3044968Z if compiled: 2025-05-07T20:33:29.3045215Z op = torch.compile(op) 2025-05-07T20:33:29.3045511Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.3045788Z 2025-05-07T20:33:29.3045972Z > y_fp8, y_scale = fn() 2025-05-07T20:33:29.3046136Z 2025-05-07T20:33:29.3046238Z moe/activation_test.py:117: 2025-05-07T20:33:29.3046534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.3046876Z moe/activation_test.py:115: in fn 2025-05-07T20:33:29.3047156Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.3047730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:29.3048315Z return fn(*args, **kwargs) 2025-05-07T20:33:29.3048997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:29.3049717Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:29.3050267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.3050978Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.3051673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.3052224Z kernel = self.compile( 2025-05-07T20:33:29.3052786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.3053474Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.3053883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.3054121Z 2025-05-07T20:33:29.3054331Z self = 2025-05-07T20:33:29.3055571Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.3057055Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e54859e0>} 2025-05-07T20:33:29.3058505Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.3059595Z context = 2025-05-07T20:33:29.3059933Z 2025-05-07T20:33:29.3060103Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.3060644Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.3061125Z module_map=module_map) 2025-05-07T20:33:29.3061496Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.3061859Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:29.3062120Z E ^ 2025-05-07T20:33:29.3062605Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.3063081Z 2025-05-07T20:33:29.3063520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.3064116Z 2025-05-07T20:33:29.3064262Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.3064691Z self=, 2025-05-07T20:33:29.3065103Z T=2048, 2025-05-07T20:33:29.3065292Z D=5120, 2025-05-07T20:33:29.3065482Z scale_ub=1200.0, 2025-05-07T20:33:29.3065700Z contiguous=False, 2025-05-07T20:33:29.3065920Z compiled=True, 2025-05-07T20:33:29.3066114Z ) 2025-05-07T20:33:29.3066430Z self = 2025-05-07T20:33:29.3066936Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:29.3067220Z 2025-05-07T20:33:29.3067295Z @given( 2025-05-07T20:33:29.3067517Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.3067830Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.3068140Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.3068474Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.3068801Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.3069094Z ) 2025-05-07T20:33:29.3069441Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.3069891Z def test_silu_mul_quant( 2025-05-07T20:33:29.3070121Z self, 2025-05-07T20:33:29.3070312Z T: int, 2025-05-07T20:33:29.3070506Z D: int, 2025-05-07T20:33:29.3070716Z scale_ub: Optional[float], 2025-05-07T20:33:29.3070989Z contiguous: bool, 2025-05-07T20:33:29.3071223Z compiled: bool, 2025-05-07T20:33:29.3071434Z ) -> None: 2025-05-07T20:33:29.3071645Z torch.manual_seed(2025) 2025-05-07T20:33:29.3071885Z 2025-05-07T20:33:29.3072156Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.3072509Z 2025-05-07T20:33:29.3072699Z x_sign = torch.sign(x) 2025-05-07T20:33:29.3072981Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.3073301Z x = x_sign * x_clamp 2025-05-07T20:33:29.3073541Z x0 = x[:, :D] 2025-05-07T20:33:29.3073747Z x1 = x[:, D:] 2025-05-07T20:33:29.3073950Z 2025-05-07T20:33:29.3074127Z if contiguous: 2025-05-07T20:33:29.3074355Z x0 = x0.contiguous() 2025-05-07T20:33:29.3074608Z x1 = x1.contiguous() 2025-05-07T20:33:29.3074852Z 2025-05-07T20:33:29.3075039Z if scale_ub is not None: 2025-05-07T20:33:29.3075305Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.3075698Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.3076015Z ) 2025-05-07T20:33:29.3076201Z else: 2025-05-07T20:33:29.3076405Z scale_ub_tensor = None 2025-05-07T20:33:29.3076659Z 2025-05-07T20:33:29.3076922Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.3077247Z op = silu_mul_quant 2025-05-07T20:33:29.3077506Z if compiled: 2025-05-07T20:33:29.3077755Z op = torch.compile(op) 2025-05-07T20:33:29.3078099Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.3078383Z 2025-05-07T20:33:29.3078576Z > y_fp8, y_scale = fn() 2025-05-07T20:33:29.3078747Z 2025-05-07T20:33:29.3078849Z moe/activation_test.py:117: 2025-05-07T20:33:29.3079148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.3079495Z moe/activation_test.py:115: in fn 2025-05-07T20:33:29.3079781Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.3080366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:29.3080956Z return fn(*args, **kwargs) 2025-05-07T20:33:29.3081646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:29.3082374Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:29.3082981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.3083706Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.3084400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.3084963Z kernel = self.compile( 2025-05-07T20:33:29.3085530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.3086222Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.3086635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.3086877Z 2025-05-07T20:33:29.3087089Z self = 2025-05-07T20:33:29.3088212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.3089650Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e5486b60>} 2025-05-07T20:33:29.3091052Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.3092143Z context = 2025-05-07T20:33:29.3092451Z 2025-05-07T20:33:29.3092621Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.3093165Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.3093645Z module_map=module_map) 2025-05-07T20:33:29.3094068Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.3094503Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:29.3094763Z E ^ 2025-05-07T20:33:29.3095242Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.3095723Z 2025-05-07T20:33:29.3096162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.3096807Z 2025-05-07T20:33:29.4845870Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.4846529Z self=, 2025-05-07T20:33:29.4847096Z T=4096, 2025-05-07T20:33:29.4847351Z D=5120, 2025-05-07T20:33:29.4847757Z scale_ub=1200.0, 2025-05-07T20:33:29.4847977Z contiguous=True, 2025-05-07T20:33:29.4854833Z compiled=True, 2025-05-07T20:33:29.4855072Z ) 2025-05-07T20:33:29.4855419Z self = 2025-05-07T20:33:29.4856060Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:29.4856346Z 2025-05-07T20:33:29.4856433Z @given( 2025-05-07T20:33:29.4856678Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.4856997Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.4857304Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.4857636Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.4857967Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.4858266Z ) 2025-05-07T20:33:29.4858614Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.4859068Z def test_silu_mul_quant( 2025-05-07T20:33:29.4859311Z self, 2025-05-07T20:33:29.4859504Z T: int, 2025-05-07T20:33:29.4859700Z D: int, 2025-05-07T20:33:29.4859986Z scale_ub: Optional[float], 2025-05-07T20:33:29.4860258Z contiguous: bool, 2025-05-07T20:33:29.4860504Z compiled: bool, 2025-05-07T20:33:29.4860728Z ) -> None: 2025-05-07T20:33:29.4860945Z torch.manual_seed(2025) 2025-05-07T20:33:29.4861200Z 2025-05-07T20:33:29.4861484Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.4861848Z 2025-05-07T20:33:29.4862050Z x_sign = torch.sign(x) 2025-05-07T20:33:29.4862348Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.4862676Z x = x_sign * x_clamp 2025-05-07T20:33:29.4862920Z x0 = x[:, :D] 2025-05-07T20:33:29.4863143Z x1 = x[:, D:] 2025-05-07T20:33:29.4863362Z 2025-05-07T20:33:29.4863554Z if contiguous: 2025-05-07T20:33:29.4863797Z x0 = x0.contiguous() 2025-05-07T20:33:29.4864069Z x1 = x1.contiguous() 2025-05-07T20:33:29.4864314Z 2025-05-07T20:33:29.4864514Z if scale_ub is not None: 2025-05-07T20:33:29.4864804Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.4865144Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.4865463Z ) 2025-05-07T20:33:29.4865664Z else: 2025-05-07T20:33:29.4865902Z scale_ub_tensor = None 2025-05-07T20:33:29.4866163Z 2025-05-07T20:33:29.4866401Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.4866716Z op = silu_mul_quant 2025-05-07T20:33:29.4866962Z if compiled: 2025-05-07T20:33:29.4867204Z op = torch.compile(op) 2025-05-07T20:33:29.4867501Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.4867774Z 2025-05-07T20:33:29.4867962Z > y_fp8, y_scale = fn() 2025-05-07T20:33:29.4868124Z 2025-05-07T20:33:29.4868227Z moe/activation_test.py:117: 2025-05-07T20:33:29.4868522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.4868853Z moe/activation_test.py:115: in fn 2025-05-07T20:33:29.4869136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.4869715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:29.4870292Z return fn(*args, **kwargs) 2025-05-07T20:33:29.4870971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:29.4871688Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:29.4872309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.4873012Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.4873736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.4874295Z kernel = self.compile( 2025-05-07T20:33:29.4874851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.4875577Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.4875985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.4876221Z 2025-05-07T20:33:29.4876437Z self = 2025-05-07T20:33:29.4877557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.4878989Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4590180>} 2025-05-07T20:33:29.4880432Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.4881522Z context = 2025-05-07T20:33:29.4881824Z 2025-05-07T20:33:29.4882004Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.4882543Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.4883028Z module_map=module_map) 2025-05-07T20:33:29.4883403Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.4883763Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:29.4884029Z E ^ 2025-05-07T20:33:29.4884549Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.4885028Z 2025-05-07T20:33:29.4885471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.4886016Z 2025-05-07T20:33:29.4886124Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.4886555Z self=, 2025-05-07T20:33:29.4886978Z T=128, 2025-05-07T20:33:29.4887167Z D=5120, 2025-05-07T20:33:29.4887362Z scale_ub=1200.0, 2025-05-07T20:33:29.4887588Z contiguous=False, 2025-05-07T20:33:29.4887813Z compiled=True, 2025-05-07T20:33:29.4888021Z ) 2025-05-07T20:33:29.7737263Z self = 2025-05-07T20:33:29.7737827Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:29.7738154Z 2025-05-07T20:33:29.7738265Z @given( 2025-05-07T20:33:29.7738600Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.7739025Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.7739422Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.7739833Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.7740174Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.7740458Z ) 2025-05-07T20:33:29.7740810Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.7741265Z def test_silu_mul_quant( 2025-05-07T20:33:29.7741542Z self, 2025-05-07T20:33:29.7741739Z T: int, 2025-05-07T20:33:29.7741934Z D: int, 2025-05-07T20:33:29.7742319Z scale_ub: Optional[float], 2025-05-07T20:33:29.7742599Z contiguous: bool, 2025-05-07T20:33:29.7742838Z compiled: bool, 2025-05-07T20:33:29.7743064Z ) -> None: 2025-05-07T20:33:29.7743285Z torch.manual_seed(2025) 2025-05-07T20:33:29.7743528Z 2025-05-07T20:33:29.7743883Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.7744243Z 2025-05-07T20:33:29.7744439Z x_sign = torch.sign(x) 2025-05-07T20:33:29.7744733Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.7745121Z x = x_sign * x_clamp 2025-05-07T20:33:29.7745363Z x0 = x[:, :D] 2025-05-07T20:33:29.7745573Z x1 = x[:, D:] 2025-05-07T20:33:29.7745788Z 2025-05-07T20:33:29.7745978Z if contiguous: 2025-05-07T20:33:29.7746209Z x0 = x0.contiguous() 2025-05-07T20:33:29.7746470Z x1 = x1.contiguous() 2025-05-07T20:33:29.7746719Z 2025-05-07T20:33:29.7746906Z if scale_ub is not None: 2025-05-07T20:33:29.7747190Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.7747538Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.7747848Z ) 2025-05-07T20:33:29.7748044Z else: 2025-05-07T20:33:29.7748267Z scale_ub_tensor = None 2025-05-07T20:33:29.7748518Z 2025-05-07T20:33:29.7748752Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.7749080Z op = silu_mul_quant 2025-05-07T20:33:29.7749423Z if compiled: 2025-05-07T20:33:29.7749677Z op = torch.compile(op) 2025-05-07T20:33:29.7749981Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.7750268Z 2025-05-07T20:33:29.7750455Z > y_fp8, y_scale = fn() 2025-05-07T20:33:29.7750630Z 2025-05-07T20:33:29.7750729Z moe/activation_test.py:117: 2025-05-07T20:33:29.7751039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.7751379Z moe/activation_test.py:115: in fn 2025-05-07T20:33:29.7751671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.7752259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:29.7752851Z return fn(*args, **kwargs) 2025-05-07T20:33:29.7753538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:29.7754327Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:29.7754888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.7755603Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.7756300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.7756860Z kernel = self.compile( 2025-05-07T20:33:29.7757431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.7758119Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.7758530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.7758776Z 2025-05-07T20:33:29.7758994Z self = 2025-05-07T20:33:29.7760130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.7761565Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4590ea0>} 2025-05-07T20:33:29.7762978Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.7764125Z context = 2025-05-07T20:33:29.7764430Z 2025-05-07T20:33:29.7764681Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.7765229Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.7765726Z module_map=module_map) 2025-05-07T20:33:29.7766156Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.7766536Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:29.7766802Z E ^ 2025-05-07T20:33:29.7767284Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.7767764Z 2025-05-07T20:33:29.7768215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.7768768Z 2025-05-07T20:33:29.7768878Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.7769313Z self=, 2025-05-07T20:33:29.7769743Z T=16384, 2025-05-07T20:33:29.7769955Z D=7168, 2025-05-07T20:33:29.7770154Z scale_ub=1200.0, 2025-05-07T20:33:29.7770389Z contiguous=True, 2025-05-07T20:33:29.7770619Z compiled=True, 2025-05-07T20:33:29.7770875Z ) 2025-05-07T20:33:29.7771222Z self = 2025-05-07T20:33:29.7771750Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:29.7772039Z 2025-05-07T20:33:29.7772119Z @given( 2025-05-07T20:33:29.7772356Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.7772689Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.7773016Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.7773356Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.7773700Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.7774009Z ) 2025-05-07T20:33:29.7774375Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.7774969Z def test_silu_mul_quant( 2025-05-07T20:33:29.7775224Z self, 2025-05-07T20:33:29.7775417Z T: int, 2025-05-07T20:33:29.7775619Z D: int, 2025-05-07T20:33:29.7775843Z scale_ub: Optional[float], 2025-05-07T20:33:29.7776129Z contiguous: bool, 2025-05-07T20:33:29.7776374Z compiled: bool, 2025-05-07T20:33:29.7776606Z ) -> None: 2025-05-07T20:33:29.7776825Z torch.manual_seed(2025) 2025-05-07T20:33:29.7777061Z 2025-05-07T20:33:29.7777347Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.7777706Z 2025-05-07T20:33:29.7777896Z x_sign = torch.sign(x) 2025-05-07T20:33:29.7778190Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.7778509Z x = x_sign * x_clamp 2025-05-07T20:33:29.7778749Z x0 = x[:, :D] 2025-05-07T20:33:29.7778964Z x1 = x[:, D:] 2025-05-07T20:33:29.7779173Z 2025-05-07T20:33:29.7779354Z if contiguous: 2025-05-07T20:33:29.7779586Z x0 = x0.contiguous() 2025-05-07T20:33:29.7779848Z x1 = x1.contiguous() 2025-05-07T20:33:29.7780092Z 2025-05-07T20:33:29.7780280Z if scale_ub is not None: 2025-05-07T20:33:29.7780559Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.7780898Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.7781206Z ) 2025-05-07T20:33:29.7781401Z else: 2025-05-07T20:33:29.7781615Z scale_ub_tensor = None 2025-05-07T20:33:29.7781867Z 2025-05-07T20:33:29.7782103Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.7782479Z op = silu_mul_quant 2025-05-07T20:33:29.7782731Z if compiled: 2025-05-07T20:33:29.7782982Z op = torch.compile(op) 2025-05-07T20:33:29.7783282Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.7783557Z 2025-05-07T20:33:29.7783750Z > y_fp8, y_scale = fn() 2025-05-07T20:33:29.7783969Z 2025-05-07T20:33:29.7784087Z moe/activation_test.py:117: 2025-05-07T20:33:29.7784416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.7784751Z moe/activation_test.py:115: in fn 2025-05-07T20:33:29.7785084Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.7785666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:29.7786251Z return fn(*args, **kwargs) 2025-05-07T20:33:29.7786948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:29.7787679Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:29.7788242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.7788954Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.7789659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.7790223Z kernel = self.compile( 2025-05-07T20:33:29.7790838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.7791542Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.7791964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.7792213Z 2025-05-07T20:33:29.7792438Z self = 2025-05-07T20:33:29.7793571Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.7795004Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e45920c0>} 2025-05-07T20:33:29.7796426Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.7797524Z context = 2025-05-07T20:33:29.7797828Z 2025-05-07T20:33:29.7798009Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.7798555Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.7799051Z module_map=module_map) 2025-05-07T20:33:29.7799435Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.7799798Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:29.7800072Z E ^ 2025-05-07T20:33:29.7800563Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.7801038Z 2025-05-07T20:33:29.7801488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.7802039Z 2025-05-07T20:33:29.9036859Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.9037356Z self=, 2025-05-07T20:33:29.9037934Z T=16384, 2025-05-07T20:33:29.9038133Z D=5120, 2025-05-07T20:33:29.9038322Z scale_ub=1200.0, 2025-05-07T20:33:29.9038542Z contiguous=True, 2025-05-07T20:33:29.9038885Z compiled=False, 2025-05-07T20:33:29.9039080Z ) 2025-05-07T20:33:29.9039404Z self = 2025-05-07T20:33:29.9039923Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:29.9040211Z 2025-05-07T20:33:29.9040352Z @given( 2025-05-07T20:33:29.9040583Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.9040900Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.9041214Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.9041605Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.9041936Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.9042232Z ) 2025-05-07T20:33:29.9042580Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.9043036Z def test_silu_mul_quant( 2025-05-07T20:33:29.9043285Z self, 2025-05-07T20:33:29.9043472Z T: int, 2025-05-07T20:33:29.9043668Z D: int, 2025-05-07T20:33:29.9043896Z scale_ub: Optional[float], 2025-05-07T20:33:29.9044174Z contiguous: bool, 2025-05-07T20:33:29.9044423Z compiled: bool, 2025-05-07T20:33:29.9044652Z ) -> None: 2025-05-07T20:33:29.9044862Z torch.manual_seed(2025) 2025-05-07T20:33:29.9045105Z 2025-05-07T20:33:29.9045379Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.9045726Z 2025-05-07T20:33:29.9046020Z x_sign = torch.sign(x) 2025-05-07T20:33:29.9046310Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.9046625Z x = x_sign * x_clamp 2025-05-07T20:33:29.9046862Z x0 = x[:, :D] 2025-05-07T20:33:29.9047071Z x1 = x[:, D:] 2025-05-07T20:33:29.9047281Z 2025-05-07T20:33:29.9047468Z if contiguous: 2025-05-07T20:33:29.9047696Z x0 = x0.contiguous() 2025-05-07T20:33:29.9047962Z x1 = x1.contiguous() 2025-05-07T20:33:29.9048204Z 2025-05-07T20:33:29.9048397Z if scale_ub is not None: 2025-05-07T20:33:29.9048669Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.9049007Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.9049320Z ) 2025-05-07T20:33:29.9049508Z else: 2025-05-07T20:33:29.9049716Z scale_ub_tensor = None 2025-05-07T20:33:29.9049968Z 2025-05-07T20:33:29.9050194Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.9050517Z op = silu_mul_quant 2025-05-07T20:33:29.9050772Z if compiled: 2025-05-07T20:33:29.9051013Z op = torch.compile(op) 2025-05-07T20:33:29.9051313Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.9051594Z 2025-05-07T20:33:29.9051781Z > y_fp8, y_scale = fn() 2025-05-07T20:33:29.9051952Z 2025-05-07T20:33:29.9052048Z moe/activation_test.py:117: 2025-05-07T20:33:29.9052346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.9052686Z moe/activation_test.py:115: in fn 2025-05-07T20:33:29.9052963Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.9053691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:29.9054508Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:29.9055070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.9055797Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.9056490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.9057047Z kernel = self.compile( 2025-05-07T20:33:29.9057605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.9058350Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.9058768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.9059006Z 2025-05-07T20:33:29.9059232Z self = 2025-05-07T20:33:29.9060395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.9061867Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4591a80>} 2025-05-07T20:33:29.9063276Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.9064373Z context = 2025-05-07T20:33:29.9064672Z 2025-05-07T20:33:29.9064842Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.9065394Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.9065884Z module_map=module_map) 2025-05-07T20:33:29.9066312Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.9066674Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:29.9066952Z E ^ 2025-05-07T20:33:29.9067442Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.9067914Z 2025-05-07T20:33:29.9068354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.9068907Z 2025-05-07T20:33:29.9069018Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:29.9069448Z self=, 2025-05-07T20:33:29.9069871Z T=1, 2025-05-07T20:33:29.9070063Z D=7168, 2025-05-07T20:33:29.9070254Z scale_ub=1200.0, 2025-05-07T20:33:29.9070481Z contiguous=False, 2025-05-07T20:33:29.9070701Z compiled=False, 2025-05-07T20:33:29.9070908Z ) 2025-05-07T20:33:29.9071235Z self = 2025-05-07T20:33:29.9071732Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:29.9072016Z 2025-05-07T20:33:29.9072090Z @given( 2025-05-07T20:33:29.9072316Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:29.9072631Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:29.9072938Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:29.9073272Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:29.9073608Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:29.9073891Z ) 2025-05-07T20:33:29.9074244Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:29.9074699Z def test_silu_mul_quant( 2025-05-07T20:33:29.9074938Z self, 2025-05-07T20:33:29.9075135Z T: int, 2025-05-07T20:33:29.9075331Z D: int, 2025-05-07T20:33:29.9075543Z scale_ub: Optional[float], 2025-05-07T20:33:29.9075823Z contiguous: bool, 2025-05-07T20:33:29.9076066Z compiled: bool, 2025-05-07T20:33:29.9076282Z ) -> None: 2025-05-07T20:33:29.9076496Z torch.manual_seed(2025) 2025-05-07T20:33:29.9076746Z 2025-05-07T20:33:29.9077016Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:29.9077376Z 2025-05-07T20:33:29.9077570Z x_sign = torch.sign(x) 2025-05-07T20:33:29.9077865Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:29.9079757Z x = x_sign * x_clamp 2025-05-07T20:33:29.9079999Z x0 = x[:, :D] 2025-05-07T20:33:29.9080218Z x1 = x[:, D:] 2025-05-07T20:33:29.9080417Z 2025-05-07T20:33:29.9080606Z if contiguous: 2025-05-07T20:33:29.9080842Z x0 = x0.contiguous() 2025-05-07T20:33:29.9081140Z x1 = x1.contiguous() 2025-05-07T20:33:29.9081382Z 2025-05-07T20:33:29.9081575Z if scale_ub is not None: 2025-05-07T20:33:29.9081847Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:29.9082193Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:29.9082549Z ) 2025-05-07T20:33:29.9090374Z else: 2025-05-07T20:33:29.9090611Z scale_ub_tensor = None 2025-05-07T20:33:29.9090884Z 2025-05-07T20:33:29.9091128Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:29.9091469Z op = silu_mul_quant 2025-05-07T20:33:29.9091733Z if compiled: 2025-05-07T20:33:29.9091993Z op = torch.compile(op) 2025-05-07T20:33:29.9092301Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.9092590Z 2025-05-07T20:33:29.9092784Z > y_fp8, y_scale = fn() 2025-05-07T20:33:29.9092959Z 2025-05-07T20:33:29.9093063Z moe/activation_test.py:117: 2025-05-07T20:33:29.9093376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.9093722Z moe/activation_test.py:115: in fn 2025-05-07T20:33:29.9094096Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:29.9094902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:29.9095646Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:29.9096208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:29.9096931Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:29.9097634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:29.9098200Z kernel = self.compile( 2025-05-07T20:33:29.9098761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:29.9099459Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:29.9099879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:29.9100117Z 2025-05-07T20:33:29.9100335Z self = 2025-05-07T20:33:29.9101470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:29.9102903Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e41ac0e0>} 2025-05-07T20:33:29.9104314Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:29.9105403Z context = 2025-05-07T20:33:29.9105704Z 2025-05-07T20:33:29.9105876Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:29.9106416Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:29.9106901Z module_map=module_map) 2025-05-07T20:33:29.9107279Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:29.9107639Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:29.9107915Z E ^ 2025-05-07T20:33:29.9108456Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:29.9108931Z 2025-05-07T20:33:29.9109370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:29.9109924Z 2025-05-07T20:33:30.0841021Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.0841496Z self=, 2025-05-07T20:33:30.0841949Z T=4096, 2025-05-07T20:33:30.0842641Z D=7168, 2025-05-07T20:33:30.0843507Z scale_ub=1200.0, 2025-05-07T20:33:30.0844033Z contiguous=False, 2025-05-07T20:33:30.0844357Z compiled=True, 2025-05-07T20:33:30.0844578Z ) 2025-05-07T20:33:30.0844911Z self = 2025-05-07T20:33:30.0845443Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:30.0845734Z 2025-05-07T20:33:30.0845828Z @given( 2025-05-07T20:33:30.0846063Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.0846381Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.0846695Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.0847039Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.0847371Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.0847672Z ) 2025-05-07T20:33:30.0848105Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.0848566Z def test_silu_mul_quant( 2025-05-07T20:33:30.0848812Z self, 2025-05-07T20:33:30.0849011Z T: int, 2025-05-07T20:33:30.0849202Z D: int, 2025-05-07T20:33:30.0849426Z scale_ub: Optional[float], 2025-05-07T20:33:30.0849702Z contiguous: bool, 2025-05-07T20:33:30.0849939Z compiled: bool, 2025-05-07T20:33:30.0850169Z ) -> None: 2025-05-07T20:33:30.0850395Z torch.manual_seed(2025) 2025-05-07T20:33:30.0850649Z 2025-05-07T20:33:30.0850924Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.0851278Z 2025-05-07T20:33:30.0851475Z x_sign = torch.sign(x) 2025-05-07T20:33:30.0851770Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.0852093Z x = x_sign * x_clamp 2025-05-07T20:33:30.0852342Z x0 = x[:, :D] 2025-05-07T20:33:30.0852554Z x1 = x[:, D:] 2025-05-07T20:33:30.0852765Z 2025-05-07T20:33:30.0852966Z if contiguous: 2025-05-07T20:33:30.0853195Z x0 = x0.contiguous() 2025-05-07T20:33:30.0853461Z x1 = x1.contiguous() 2025-05-07T20:33:30.0853708Z 2025-05-07T20:33:30.0853895Z if scale_ub is not None: 2025-05-07T20:33:30.0854219Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:30.0854649Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:30.0854962Z ) 2025-05-07T20:33:30.0855163Z else: 2025-05-07T20:33:30.0855379Z scale_ub_tensor = None 2025-05-07T20:33:30.0855638Z 2025-05-07T20:33:30.0855876Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.0856192Z op = silu_mul_quant 2025-05-07T20:33:30.0856451Z if compiled: 2025-05-07T20:33:30.0856698Z op = torch.compile(op) 2025-05-07T20:33:30.0857007Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.0857293Z 2025-05-07T20:33:30.0857485Z > y_fp8, y_scale = fn() 2025-05-07T20:33:30.0857672Z 2025-05-07T20:33:30.0857776Z moe/activation_test.py:117: 2025-05-07T20:33:30.0858077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.0858426Z moe/activation_test.py:115: in fn 2025-05-07T20:33:30.0858709Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.0859294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:30.0859959Z return fn(*args, **kwargs) 2025-05-07T20:33:30.0860641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:30.0861369Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:30.0861978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:30.0862702Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:30.0863394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:30.0863996Z kernel = self.compile( 2025-05-07T20:33:30.0864563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:30.0865250Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:30.0865668Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.0865913Z 2025-05-07T20:33:30.0866126Z self = 2025-05-07T20:33:30.0867262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:30.0868747Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e41ad300>} 2025-05-07T20:33:30.0870162Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:30.0871258Z context = 2025-05-07T20:33:30.0871568Z 2025-05-07T20:33:30.0871739Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:30.0872282Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:30.0872763Z module_map=module_map) 2025-05-07T20:33:30.0873144Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:30.0873511Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:30.0873778Z E ^ 2025-05-07T20:33:30.0874272Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:30.0874751Z 2025-05-07T20:33:30.0875190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:30.0875734Z 2025-05-07T20:33:30.0875846Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.0876274Z self=, 2025-05-07T20:33:30.0876694Z T=128, 2025-05-07T20:33:30.0877065Z D=7168, 2025-05-07T20:33:30.0877263Z scale_ub=1200.0, 2025-05-07T20:33:30.0877497Z contiguous=False, 2025-05-07T20:33:30.0877731Z compiled=True, 2025-05-07T20:33:30.0877935Z ) 2025-05-07T20:33:30.1790054Z self = 2025-05-07T20:33:30.1790902Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:30.1791315Z 2025-05-07T20:33:30.1791442Z @given( 2025-05-07T20:33:30.1791774Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.1792190Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.1792508Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.1792849Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.1793179Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.1793474Z ) 2025-05-07T20:33:30.1793948Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.1794402Z def test_silu_mul_quant( 2025-05-07T20:33:30.1794646Z self, 2025-05-07T20:33:30.1794847Z T: int, 2025-05-07T20:33:30.1795049Z D: int, 2025-05-07T20:33:30.1795336Z scale_ub: Optional[float], 2025-05-07T20:33:30.1795616Z contiguous: bool, 2025-05-07T20:33:30.1795857Z compiled: bool, 2025-05-07T20:33:30.1796088Z ) -> None: 2025-05-07T20:33:30.1796309Z torch.manual_seed(2025) 2025-05-07T20:33:30.1796547Z 2025-05-07T20:33:30.1796893Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.1797255Z 2025-05-07T20:33:30.1797458Z x_sign = torch.sign(x) 2025-05-07T20:33:30.1797753Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.1798080Z x = x_sign * x_clamp 2025-05-07T20:33:30.1798326Z x0 = x[:, :D] 2025-05-07T20:33:30.1798543Z x1 = x[:, D:] 2025-05-07T20:33:30.1798766Z 2025-05-07T20:33:30.1798962Z if contiguous: 2025-05-07T20:33:30.1799198Z x0 = x0.contiguous() 2025-05-07T20:33:30.1799468Z x1 = x1.contiguous() 2025-05-07T20:33:30.1799717Z 2025-05-07T20:33:30.1799912Z if scale_ub is not None: 2025-05-07T20:33:30.1800202Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:30.1800552Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:30.1800870Z ) 2025-05-07T20:33:30.1801139Z else: 2025-05-07T20:33:30.1801363Z scale_ub_tensor = None 2025-05-07T20:33:30.1801620Z 2025-05-07T20:33:30.1801864Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.1802191Z op = silu_mul_quant 2025-05-07T20:33:30.1802452Z if compiled: 2025-05-07T20:33:30.1802701Z op = torch.compile(op) 2025-05-07T20:33:30.1803017Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.1803306Z 2025-05-07T20:33:30.1803514Z > y_fp8, y_scale = fn() 2025-05-07T20:33:30.1803683Z 2025-05-07T20:33:30.1803786Z moe/activation_test.py:117: 2025-05-07T20:33:30.1804091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.1804438Z moe/activation_test.py:115: in fn 2025-05-07T20:33:30.1804727Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.1805317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:30.1805907Z return fn(*args, **kwargs) 2025-05-07T20:33:30.1806603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:30.1807326Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:30.1807885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:30.1808603Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:30.1809300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:30.1809861Z kernel = self.compile( 2025-05-07T20:33:30.1810429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:30.1811119Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:30.1811529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.1811778Z 2025-05-07T20:33:30.1811992Z self = 2025-05-07T20:33:30.1813115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:30.1814796Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e41ae020>} 2025-05-07T20:33:30.1816262Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:30.1817356Z context = 2025-05-07T20:33:30.1817662Z 2025-05-07T20:33:30.1817877Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:30.1818430Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:30.1818919Z module_map=module_map) 2025-05-07T20:33:30.1819300Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:30.1819673Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:30.1819953Z E ^ 2025-05-07T20:33:30.1820440Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:30.1820920Z 2025-05-07T20:33:30.1821370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:30.1821919Z 2025-05-07T20:33:30.1822028Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.1822571Z self=, 2025-05-07T20:33:30.1822994Z T=2048, 2025-05-07T20:33:30.1823194Z D=7168, 2025-05-07T20:33:30.1823394Z scale_ub=None, 2025-05-07T20:33:30.1823613Z contiguous=True, 2025-05-07T20:33:30.1823849Z compiled=True, 2025-05-07T20:33:30.1824067Z ) 2025-05-07T20:33:30.1824441Z self = 2025-05-07T20:33:30.1824956Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:30.1825237Z 2025-05-07T20:33:30.1825327Z @given( 2025-05-07T20:33:30.1825754Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.1826077Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.1826400Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.1826743Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.1827083Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.1827386Z ) 2025-05-07T20:33:30.1827755Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.1828218Z def test_silu_mul_quant( 2025-05-07T20:33:30.1828472Z self, 2025-05-07T20:33:30.1828677Z T: int, 2025-05-07T20:33:30.1828880Z D: int, 2025-05-07T20:33:30.1829103Z scale_ub: Optional[float], 2025-05-07T20:33:30.1829381Z contiguous: bool, 2025-05-07T20:33:30.1829620Z compiled: bool, 2025-05-07T20:33:30.1829853Z ) -> None: 2025-05-07T20:33:30.1830075Z torch.manual_seed(2025) 2025-05-07T20:33:30.1830320Z 2025-05-07T20:33:30.1830600Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.1830954Z 2025-05-07T20:33:30.1831153Z x_sign = torch.sign(x) 2025-05-07T20:33:30.1831446Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.1831762Z x = x_sign * x_clamp 2025-05-07T20:33:30.1832010Z x0 = x[:, :D] 2025-05-07T20:33:30.1832228Z x1 = x[:, D:] 2025-05-07T20:33:30.1832437Z 2025-05-07T20:33:30.1832625Z if contiguous: 2025-05-07T20:33:30.1832855Z x0 = x0.contiguous() 2025-05-07T20:33:30.1833118Z x1 = x1.contiguous() 2025-05-07T20:33:30.1833364Z 2025-05-07T20:33:30.1833555Z if scale_ub is not None: 2025-05-07T20:33:30.1833836Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:30.1834177Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:30.1834589Z ) 2025-05-07T20:33:30.1834785Z else: 2025-05-07T20:33:30.1835000Z scale_ub_tensor = None 2025-05-07T20:33:30.1835280Z 2025-05-07T20:33:30.1835507Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.1835831Z op = silu_mul_quant 2025-05-07T20:33:30.1836148Z if compiled: 2025-05-07T20:33:30.1836395Z op = torch.compile(op) 2025-05-07T20:33:30.1836705Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.1836989Z 2025-05-07T20:33:30.1837178Z > y_fp8, y_scale = fn() 2025-05-07T20:33:30.1837410Z 2025-05-07T20:33:30.1837510Z moe/activation_test.py:117: 2025-05-07T20:33:30.1837809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.1838152Z moe/activation_test.py:115: in fn 2025-05-07T20:33:30.1838438Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.1839022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:30.1839613Z return fn(*args, **kwargs) 2025-05-07T20:33:30.1840298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:30.1841029Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:30.1841591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:30.1842371Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:30.1843068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:30.1843628Z kernel = self.compile( 2025-05-07T20:33:30.1844197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:30.1844883Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:30.1845296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.1845541Z 2025-05-07T20:33:30.1845759Z self = 2025-05-07T20:33:30.1846896Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:30.1848326Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e41af240>} 2025-05-07T20:33:30.1849731Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:30.1850824Z context = 2025-05-07T20:33:30.1851126Z 2025-05-07T20:33:30.1851296Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:30.1851832Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:30.1852318Z module_map=module_map) 2025-05-07T20:33:30.1852695Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:30.1853062Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:30.1853322Z E ^ 2025-05-07T20:33:30.1853812Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:30.1854286Z 2025-05-07T20:33:30.1854787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:30.1855329Z 2025-05-07T20:33:30.2501968Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.2502751Z self=, 2025-05-07T20:33:30.2503331Z T=16384, 2025-05-07T20:33:30.2503597Z D=5120, 2025-05-07T20:33:30.2503860Z scale_ub=None, 2025-05-07T20:33:30.2504153Z contiguous=False, 2025-05-07T20:33:30.2504403Z compiled=False, 2025-05-07T20:33:30.2504964Z ) 2025-05-07T20:33:30.2505593Z self = 2025-05-07T20:33:30.2506616Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:30.2507307Z 2025-05-07T20:33:30.2507471Z @given( 2025-05-07T20:33:30.2507911Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.2508541Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.2509141Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.2509794Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.2510437Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.2511012Z ) 2025-05-07T20:33:30.2511709Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.2512596Z def test_silu_mul_quant( 2025-05-07T20:33:30.2513068Z self, 2025-05-07T20:33:30.2513442Z T: int, 2025-05-07T20:33:30.2513822Z D: int, 2025-05-07T20:33:30.2514249Z scale_ub: Optional[float], 2025-05-07T20:33:30.2514642Z contiguous: bool, 2025-05-07T20:33:30.2514876Z compiled: bool, 2025-05-07T20:33:30.2515160Z ) -> None: 2025-05-07T20:33:30.2515377Z torch.manual_seed(2025) 2025-05-07T20:33:30.2515618Z 2025-05-07T20:33:30.2515893Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.2516246Z 2025-05-07T20:33:30.2516439Z x_sign = torch.sign(x) 2025-05-07T20:33:30.2516723Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.2518880Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.2520898Z 2025-05-07T20:33:30.2521017Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:30.2521242Z 2025-05-07T20:33:30.2521344Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.2521785Z self=, 2025-05-07T20:33:30.2522202Z T=4096, 2025-05-07T20:33:30.2522383Z D=7168, 2025-05-07T20:33:30.2522570Z scale_ub=1200.0, 2025-05-07T20:33:30.2522790Z contiguous=True, 2025-05-07T20:33:30.2523008Z compiled=True, 2025-05-07T20:33:30.2523212Z ) 2025-05-07T20:33:30.2523536Z self = 2025-05-07T20:33:30.2524038Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:30.2524326Z 2025-05-07T20:33:30.2524408Z @given( 2025-05-07T20:33:30.2524634Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.2524955Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.2525265Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.2525795Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.2526126Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.2526409Z ) 2025-05-07T20:33:30.2526757Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.2527208Z def test_silu_mul_quant( 2025-05-07T20:33:30.2527446Z self, 2025-05-07T20:33:30.2527708Z T: int, 2025-05-07T20:33:30.2527902Z D: int, 2025-05-07T20:33:30.2528118Z scale_ub: Optional[float], 2025-05-07T20:33:30.2528384Z contiguous: bool, 2025-05-07T20:33:30.2528621Z compiled: bool, 2025-05-07T20:33:30.2528838Z ) -> None: 2025-05-07T20:33:30.2529045Z torch.manual_seed(2025) 2025-05-07T20:33:30.2529368Z 2025-05-07T20:33:30.2529643Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.2529981Z 2025-05-07T20:33:30.2530174Z x_sign = torch.sign(x) 2025-05-07T20:33:30.2530462Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.2532660Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.2534777Z 2025-05-07T20:33:30.2534906Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:30.2535158Z 2025-05-07T20:33:30.2535268Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.2535806Z self=, 2025-05-07T20:33:30.2536218Z T=16384, 2025-05-07T20:33:30.2536405Z D=7168, 2025-05-07T20:33:30.2536593Z scale_ub=None, 2025-05-07T20:33:30.2536802Z contiguous=False, 2025-05-07T20:33:30.2537021Z compiled=False, 2025-05-07T20:33:30.2537223Z ) 2025-05-07T20:33:30.2537549Z self = 2025-05-07T20:33:30.2538054Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:30.2538346Z 2025-05-07T20:33:30.2538425Z @given( 2025-05-07T20:33:30.2538651Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.2538967Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.2545827Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.2546213Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.2546561Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.2546858Z ) 2025-05-07T20:33:30.2547217Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.2547685Z def test_silu_mul_quant( 2025-05-07T20:33:30.2547935Z self, 2025-05-07T20:33:30.2548132Z T: int, 2025-05-07T20:33:30.2548332Z D: int, 2025-05-07T20:33:30.2548556Z scale_ub: Optional[float], 2025-05-07T20:33:30.2548827Z contiguous: bool, 2025-05-07T20:33:30.2549086Z compiled: bool, 2025-05-07T20:33:30.2549321Z ) -> None: 2025-05-07T20:33:30.2549549Z torch.manual_seed(2025) 2025-05-07T20:33:30.2549787Z 2025-05-07T20:33:30.2550068Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.2552285Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.2554307Z 2025-05-07T20:33:30.2554433Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:30.2554650Z 2025-05-07T20:33:30.2554755Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.2555252Z self=, 2025-05-07T20:33:30.2555670Z T=2048, 2025-05-07T20:33:30.2555864Z D=7168, 2025-05-07T20:33:30.2556055Z scale_ub=1200.0, 2025-05-07T20:33:30.2556283Z contiguous=True, 2025-05-07T20:33:30.2556513Z compiled=True, 2025-05-07T20:33:30.2556717Z ) 2025-05-07T20:33:30.2557082Z self = 2025-05-07T20:33:30.2557601Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:30.2557881Z 2025-05-07T20:33:30.2558003Z @given( 2025-05-07T20:33:30.2558234Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.2558561Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.2558868Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.2559201Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.2559537Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.2559835Z ) 2025-05-07T20:33:30.2560184Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.2560644Z def test_silu_mul_quant( 2025-05-07T20:33:30.2560888Z self, 2025-05-07T20:33:30.2561078Z T: int, 2025-05-07T20:33:30.2561278Z D: int, 2025-05-07T20:33:30.2561502Z scale_ub: Optional[float], 2025-05-07T20:33:30.2561773Z contiguous: bool, 2025-05-07T20:33:30.2562025Z compiled: bool, 2025-05-07T20:33:30.2562295Z ) -> None: 2025-05-07T20:33:30.2562509Z torch.manual_seed(2025) 2025-05-07T20:33:30.2562758Z 2025-05-07T20:33:30.2563035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.2563383Z 2025-05-07T20:33:30.2563581Z x_sign = torch.sign(x) 2025-05-07T20:33:30.2563874Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.2566056Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.2568036Z 2025-05-07T20:33:30.2568166Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:30.2568387Z 2025-05-07T20:33:30.2568490Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.2568917Z self=, 2025-05-07T20:33:30.2569334Z T=2048, 2025-05-07T20:33:30.2569524Z D=7168, 2025-05-07T20:33:30.2569721Z scale_ub=None, 2025-05-07T20:33:30.2569937Z contiguous=True, 2025-05-07T20:33:30.2570161Z compiled=False, 2025-05-07T20:33:30.2570369Z ) 2025-05-07T20:33:30.3692153Z self = 2025-05-07T20:33:30.3692688Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:30.3693077Z 2025-05-07T20:33:30.3693193Z @given( 2025-05-07T20:33:30.3693466Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.3693792Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.3694195Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.3695052Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.3695721Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.3696297Z ) 2025-05-07T20:33:30.3696995Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.3697908Z def test_silu_mul_quant( 2025-05-07T20:33:30.3698400Z self, 2025-05-07T20:33:30.3698778Z T: int, 2025-05-07T20:33:30.3699365Z D: int, 2025-05-07T20:33:30.3699785Z scale_ub: Optional[float], 2025-05-07T20:33:30.3700314Z contiguous: bool, 2025-05-07T20:33:30.3700782Z compiled: bool, 2025-05-07T20:33:30.3701211Z ) -> None: 2025-05-07T20:33:30.3701617Z torch.manual_seed(2025) 2025-05-07T20:33:30.3702085Z 2025-05-07T20:33:30.3702782Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.3703478Z 2025-05-07T20:33:30.3703849Z > x_sign = torch.sign(x) 2025-05-07T20:33:30.3705977Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.3708036Z 2025-05-07T20:33:30.3708154Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:30.3708380Z 2025-05-07T20:33:30.3708490Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.3708917Z self=, 2025-05-07T20:33:30.3709335Z T=1, 2025-05-07T20:33:30.3709520Z D=7168, 2025-05-07T20:33:30.3709768Z scale_ub=1200.0, 2025-05-07T20:33:30.3709985Z contiguous=True, 2025-05-07T20:33:30.3710203Z compiled=False, 2025-05-07T20:33:30.3710405Z ) 2025-05-07T20:33:30.3710719Z self = 2025-05-07T20:33:30.3711213Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:30.3711492Z 2025-05-07T20:33:30.3711569Z @given( 2025-05-07T20:33:30.3711792Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.3712104Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.3712414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.3712745Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.3713071Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.3713366Z ) 2025-05-07T20:33:30.3713710Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.3714186Z def test_silu_mul_quant( 2025-05-07T20:33:30.3714450Z self, 2025-05-07T20:33:30.3714644Z T: int, 2025-05-07T20:33:30.3714834Z D: int, 2025-05-07T20:33:30.3715048Z scale_ub: Optional[float], 2025-05-07T20:33:30.3715321Z contiguous: bool, 2025-05-07T20:33:30.3715552Z compiled: bool, 2025-05-07T20:33:30.3715774Z ) -> None: 2025-05-07T20:33:30.3715984Z torch.manual_seed(2025) 2025-05-07T20:33:30.3716220Z 2025-05-07T20:33:30.3716483Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.3716838Z 2025-05-07T20:33:30.3717024Z x_sign = torch.sign(x) 2025-05-07T20:33:30.3717306Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.3717613Z x = x_sign * x_clamp 2025-05-07T20:33:30.3717848Z x0 = x[:, :D] 2025-05-07T20:33:30.3718056Z x1 = x[:, D:] 2025-05-07T20:33:30.3718259Z 2025-05-07T20:33:30.3718435Z if contiguous: 2025-05-07T20:33:30.3718661Z x0 = x0.contiguous() 2025-05-07T20:33:30.3718920Z x1 = x1.contiguous() 2025-05-07T20:33:30.3719167Z 2025-05-07T20:33:30.3719351Z if scale_ub is not None: 2025-05-07T20:33:30.3719619Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:30.3719948Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:30.3720252Z ) 2025-05-07T20:33:30.3720441Z else: 2025-05-07T20:33:30.3720651Z scale_ub_tensor = None 2025-05-07T20:33:30.3720961Z 2025-05-07T20:33:30.3721186Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.3721504Z op = silu_mul_quant 2025-05-07T20:33:30.3721751Z if compiled: 2025-05-07T20:33:30.3721991Z op = torch.compile(op) 2025-05-07T20:33:30.3722333Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.3722609Z 2025-05-07T20:33:30.3722798Z > y_fp8, y_scale = fn() 2025-05-07T20:33:30.3722968Z 2025-05-07T20:33:30.3723070Z moe/activation_test.py:117: 2025-05-07T20:33:30.3723365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.3723740Z moe/activation_test.py:115: in fn 2025-05-07T20:33:30.3724024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.3724740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:30.3725631Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:30.3726187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:30.3726904Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:30.3727607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:30.3728155Z kernel = self.compile( 2025-05-07T20:33:30.3728786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:30.3729482Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:30.3729895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.3730132Z 2025-05-07T20:33:30.3730345Z self = 2025-05-07T20:33:30.3731474Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:30.3732907Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4296520>} 2025-05-07T20:33:30.3734319Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:30.3735463Z context = 2025-05-07T20:33:30.3735761Z 2025-05-07T20:33:30.3735930Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:30.3736475Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:30.3736959Z module_map=module_map) 2025-05-07T20:33:30.3737328Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:30.3737700Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:30.3737971Z E ^ 2025-05-07T20:33:30.3738454Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:30.3738925Z 2025-05-07T20:33:30.3739366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:30.3739914Z 2025-05-07T20:33:30.3740026Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.3740451Z self=, 2025-05-07T20:33:30.3740865Z T=128, 2025-05-07T20:33:30.3741053Z D=5120, 2025-05-07T20:33:30.3741240Z scale_ub=None, 2025-05-07T20:33:30.3741448Z contiguous=True, 2025-05-07T20:33:30.3741663Z compiled=False, 2025-05-07T20:33:30.3741928Z ) 2025-05-07T20:33:30.4414041Z self = 2025-05-07T20:33:30.4414780Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:30.4415180Z 2025-05-07T20:33:30.4415296Z @given( 2025-05-07T20:33:30.4415727Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.4416084Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.4416392Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.4416732Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.4417160Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.4417446Z ) 2025-05-07T20:33:30.4417794Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.4418245Z def test_silu_mul_quant( 2025-05-07T20:33:30.4418481Z self, 2025-05-07T20:33:30.4418674Z T: int, 2025-05-07T20:33:30.4418867Z D: int, 2025-05-07T20:33:30.4419081Z scale_ub: Optional[float], 2025-05-07T20:33:30.4419352Z contiguous: bool, 2025-05-07T20:33:30.4419585Z compiled: bool, 2025-05-07T20:33:30.4419800Z ) -> None: 2025-05-07T20:33:30.4420013Z torch.manual_seed(2025) 2025-05-07T20:33:30.4420256Z 2025-05-07T20:33:30.4420536Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.4420882Z 2025-05-07T20:33:30.4421072Z x_sign = torch.sign(x) 2025-05-07T20:33:30.4421427Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.4421744Z x = x_sign * x_clamp 2025-05-07T20:33:30.4421980Z x0 = x[:, :D] 2025-05-07T20:33:30.4422193Z x1 = x[:, D:] 2025-05-07T20:33:30.4422395Z 2025-05-07T20:33:30.4422580Z if contiguous: 2025-05-07T20:33:30.4422812Z x0 = x0.contiguous() 2025-05-07T20:33:30.4423063Z x1 = x1.contiguous() 2025-05-07T20:33:30.4423306Z 2025-05-07T20:33:30.4423502Z if scale_ub is not None: 2025-05-07T20:33:30.4423775Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:30.4424110Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:30.4424418Z ) 2025-05-07T20:33:30.4424603Z else: 2025-05-07T20:33:30.4424810Z scale_ub_tensor = None 2025-05-07T20:33:30.4425064Z 2025-05-07T20:33:30.4425291Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.4425918Z op = silu_mul_quant 2025-05-07T20:33:30.4426174Z if compiled: 2025-05-07T20:33:30.4426423Z op = torch.compile(op) 2025-05-07T20:33:30.4426719Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.4426993Z 2025-05-07T20:33:30.4427187Z > y_fp8, y_scale = fn() 2025-05-07T20:33:30.4427352Z 2025-05-07T20:33:30.4427447Z moe/activation_test.py:117: 2025-05-07T20:33:30.4427740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.4428082Z moe/activation_test.py:115: in fn 2025-05-07T20:33:30.4428354Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.4429071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:30.4429795Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:30.4430352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:30.4431061Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:30.4431761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:30.4432323Z kernel = self.compile( 2025-05-07T20:33:30.4432887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:30.4433566Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:30.4434052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.4434291Z 2025-05-07T20:33:30.4434509Z self = 2025-05-07T20:33:30.4435686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:30.4437114Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4297420>} 2025-05-07T20:33:30.4438578Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:30.4439670Z context = 2025-05-07T20:33:30.4439976Z 2025-05-07T20:33:30.4440150Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:30.4440685Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:30.4441169Z module_map=module_map) 2025-05-07T20:33:30.4441540Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:30.4441957Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:30.4442231Z E ^ 2025-05-07T20:33:30.4442717Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:30.4443187Z 2025-05-07T20:33:30.4443630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:30.4444171Z 2025-05-07T20:33:30.4444273Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.4444706Z self=, 2025-05-07T20:33:30.4445124Z T=128, 2025-05-07T20:33:30.4445316Z D=7168, 2025-05-07T20:33:30.4445508Z scale_ub=None, 2025-05-07T20:33:30.4445726Z contiguous=True, 2025-05-07T20:33:30.4445956Z compiled=False, 2025-05-07T20:33:30.4446157Z ) 2025-05-07T20:33:30.4446484Z self = 2025-05-07T20:33:30.4447002Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:30.4447284Z 2025-05-07T20:33:30.4447369Z @given( 2025-05-07T20:33:30.4447601Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.4447923Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.4448233Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.4448568Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.4448909Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.4449206Z ) 2025-05-07T20:33:30.4449560Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.4450014Z def test_silu_mul_quant( 2025-05-07T20:33:30.4450261Z self, 2025-05-07T20:33:30.4450455Z T: int, 2025-05-07T20:33:30.4450648Z D: int, 2025-05-07T20:33:30.4450872Z scale_ub: Optional[float], 2025-05-07T20:33:30.4451142Z contiguous: bool, 2025-05-07T20:33:30.4451385Z compiled: bool, 2025-05-07T20:33:30.4451608Z ) -> None: 2025-05-07T20:33:30.4451814Z torch.manual_seed(2025) 2025-05-07T20:33:30.4452055Z 2025-05-07T20:33:30.4452332Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.4452676Z 2025-05-07T20:33:30.4452868Z x_sign = torch.sign(x) 2025-05-07T20:33:30.4453158Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.4453468Z x = x_sign * x_clamp 2025-05-07T20:33:30.4453709Z x0 = x[:, :D] 2025-05-07T20:33:30.4453977Z x1 = x[:, D:] 2025-05-07T20:33:30.4454179Z 2025-05-07T20:33:30.4454359Z if contiguous: 2025-05-07T20:33:30.4454664Z x0 = x0.contiguous() 2025-05-07T20:33:30.4454921Z x1 = x1.contiguous() 2025-05-07T20:33:30.4455155Z 2025-05-07T20:33:30.4455387Z if scale_ub is not None: 2025-05-07T20:33:30.4455663Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:30.4455996Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:30.4456307Z ) 2025-05-07T20:33:30.4456541Z else: 2025-05-07T20:33:30.4456745Z scale_ub_tensor = None 2025-05-07T20:33:30.4456995Z 2025-05-07T20:33:30.4457225Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.4457536Z op = silu_mul_quant 2025-05-07T20:33:30.4457794Z if compiled: 2025-05-07T20:33:30.4458039Z op = torch.compile(op) 2025-05-07T20:33:30.4458331Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.4458611Z 2025-05-07T20:33:30.4458796Z > y_fp8, y_scale = fn() 2025-05-07T20:33:30.4458958Z 2025-05-07T20:33:30.4459059Z moe/activation_test.py:117: 2025-05-07T20:33:30.4459348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.4459692Z moe/activation_test.py:115: in fn 2025-05-07T20:33:30.4459976Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.4460766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:30.4461505Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:30.4462070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:30.4462789Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:30.4463481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:30.4464041Z kernel = self.compile( 2025-05-07T20:33:30.4464604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:30.4465310Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:30.4465715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.4465952Z 2025-05-07T20:33:30.4466170Z self = 2025-05-07T20:33:30.4467293Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:30.4468722Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e44dc4a0>} 2025-05-07T20:33:30.4470128Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:30.4471224Z context = 2025-05-07T20:33:30.4471530Z 2025-05-07T20:33:30.4471701Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:30.4472251Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:30.4472735Z module_map=module_map) 2025-05-07T20:33:30.4473112Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:30.4473480Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:30.4473742Z E ^ 2025-05-07T20:33:30.4474223Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:30.4474747Z 2025-05-07T20:33:30.4475186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:30.4475730Z 2025-05-07T20:33:30.4475846Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.4476308Z self=, 2025-05-07T20:33:30.4476727Z T=2048, 2025-05-07T20:33:30.4476917Z D=7168, 2025-05-07T20:33:30.4477115Z scale_ub=1200.0, 2025-05-07T20:33:30.4477341Z contiguous=True, 2025-05-07T20:33:30.4477606Z compiled=False, 2025-05-07T20:33:30.4477811Z ) 2025-05-07T20:33:30.5287498Z self = 2025-05-07T20:33:30.5288249Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:30.5288673Z 2025-05-07T20:33:30.5288787Z @given( 2025-05-07T20:33:30.5289111Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.5289549Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.5289985Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.5290437Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.5290858Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.5291236Z ) 2025-05-07T20:33:30.5291600Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.5292172Z def test_silu_mul_quant( 2025-05-07T20:33:30.5292410Z self, 2025-05-07T20:33:30.5292610Z T: int, 2025-05-07T20:33:30.5292803Z D: int, 2025-05-07T20:33:30.5293011Z scale_ub: Optional[float], 2025-05-07T20:33:30.5293282Z contiguous: bool, 2025-05-07T20:33:30.5293517Z compiled: bool, 2025-05-07T20:33:30.5293733Z ) -> None: 2025-05-07T20:33:30.5293943Z torch.manual_seed(2025) 2025-05-07T20:33:30.5294181Z 2025-05-07T20:33:30.5294607Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.5296839Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.5298837Z 2025-05-07T20:33:30.5298954Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:30.5299170Z 2025-05-07T20:33:30.5299269Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.5299686Z self=, 2025-05-07T20:33:30.5300093Z T=1, 2025-05-07T20:33:30.5300276Z D=5120, 2025-05-07T20:33:30.5300466Z scale_ub=1200.0, 2025-05-07T20:33:30.5300679Z contiguous=True, 2025-05-07T20:33:30.5300893Z compiled=False, 2025-05-07T20:33:30.5301092Z ) 2025-05-07T20:33:30.5301399Z self = 2025-05-07T20:33:30.5301893Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:30.5302162Z 2025-05-07T20:33:30.5302244Z @given( 2025-05-07T20:33:30.5302479Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.5302786Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.5303090Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.5303422Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.5303746Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.5304034Z ) 2025-05-07T20:33:30.5311442Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.5312053Z def test_silu_mul_quant( 2025-05-07T20:33:30.5312308Z self, 2025-05-07T20:33:30.5312499Z T: int, 2025-05-07T20:33:30.5312698Z D: int, 2025-05-07T20:33:30.5312918Z scale_ub: Optional[float], 2025-05-07T20:33:30.5313193Z contiguous: bool, 2025-05-07T20:33:30.5313505Z compiled: bool, 2025-05-07T20:33:30.5313741Z ) -> None: 2025-05-07T20:33:30.5313958Z torch.manual_seed(2025) 2025-05-07T20:33:30.5314204Z 2025-05-07T20:33:30.5314485Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.5314904Z 2025-05-07T20:33:30.5315103Z x_sign = torch.sign(x) 2025-05-07T20:33:30.5315391Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.5315704Z x = x_sign * x_clamp 2025-05-07T20:33:30.5315943Z x0 = x[:, :D] 2025-05-07T20:33:30.5316165Z x1 = x[:, D:] 2025-05-07T20:33:30.5316368Z 2025-05-07T20:33:30.5316557Z if contiguous: 2025-05-07T20:33:30.5316794Z x0 = x0.contiguous() 2025-05-07T20:33:30.5317054Z x1 = x1.contiguous() 2025-05-07T20:33:30.5317299Z 2025-05-07T20:33:30.5317495Z if scale_ub is not None: 2025-05-07T20:33:30.5317767Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:30.5318107Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:30.5318419Z ) 2025-05-07T20:33:30.5318611Z else: 2025-05-07T20:33:30.5318863Z scale_ub_tensor = None 2025-05-07T20:33:30.5319113Z 2025-05-07T20:33:30.5319346Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.5319659Z op = silu_mul_quant 2025-05-07T20:33:30.5319905Z if compiled: 2025-05-07T20:33:30.5320147Z op = torch.compile(op) 2025-05-07T20:33:30.5320440Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.5320719Z 2025-05-07T20:33:30.5320904Z > y_fp8, y_scale = fn() 2025-05-07T20:33:30.5321071Z 2025-05-07T20:33:30.5321172Z moe/activation_test.py:117: 2025-05-07T20:33:30.5321474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.5321809Z moe/activation_test.py:115: in fn 2025-05-07T20:33:30.5322092Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.5322807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:30.5323530Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:30.5324090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:30.5324807Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:30.5325699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:30.5326262Z kernel = self.compile( 2025-05-07T20:33:30.5326833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:30.5327513Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:30.5327922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.5328159Z 2025-05-07T20:33:30.5328375Z self = 2025-05-07T20:33:30.5329499Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:30.5330921Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e44dda80>} 2025-05-07T20:33:30.5332319Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:30.5333480Z context = 2025-05-07T20:33:30.5333780Z 2025-05-07T20:33:30.5334017Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:30.5334676Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:30.5335166Z module_map=module_map) 2025-05-07T20:33:30.5335614Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:30.5335986Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:30.5336251Z E ^ 2025-05-07T20:33:30.5336745Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:30.5337221Z 2025-05-07T20:33:30.5337661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:30.5338203Z 2025-05-07T20:33:30.5338311Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.5338728Z self=, 2025-05-07T20:33:30.5339143Z T=2048, 2025-05-07T20:33:30.5339334Z D=5120, 2025-05-07T20:33:30.5339526Z scale_ub=None, 2025-05-07T20:33:30.5339745Z contiguous=True, 2025-05-07T20:33:30.5339979Z compiled=False, 2025-05-07T20:33:30.5340242Z ) 2025-05-07T20:33:30.5340567Z self = 2025-05-07T20:33:30.5341083Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:30.5341364Z 2025-05-07T20:33:30.5341454Z @given( 2025-05-07T20:33:30.5341682Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.5342001Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.5342314Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.5342651Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.5342983Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.5343272Z ) 2025-05-07T20:33:30.5343623Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.5344082Z def test_silu_mul_quant( 2025-05-07T20:33:30.5344327Z self, 2025-05-07T20:33:30.5344525Z T: int, 2025-05-07T20:33:30.5344723Z D: int, 2025-05-07T20:33:30.5344948Z scale_ub: Optional[float], 2025-05-07T20:33:30.5345225Z contiguous: bool, 2025-05-07T20:33:30.5345463Z compiled: bool, 2025-05-07T20:33:30.5345687Z ) -> None: 2025-05-07T20:33:30.5345896Z torch.manual_seed(2025) 2025-05-07T20:33:30.5346132Z 2025-05-07T20:33:30.5346402Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.5346750Z 2025-05-07T20:33:30.5346937Z > x_sign = torch.sign(x) 2025-05-07T20:33:30.5350399Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.5352391Z 2025-05-07T20:33:30.5352509Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:30.5352731Z 2025-05-07T20:33:30.5352830Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.5353249Z self=, 2025-05-07T20:33:30.5353665Z T=16384, 2025-05-07T20:33:30.5353865Z D=5120, 2025-05-07T20:33:30.5354122Z scale_ub=None, 2025-05-07T20:33:30.5354368Z contiguous=True, 2025-05-07T20:33:30.5354599Z compiled=False, 2025-05-07T20:33:30.5354805Z ) 2025-05-07T20:33:30.6101922Z self = 2025-05-07T20:33:30.6102538Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:30.6102994Z 2025-05-07T20:33:30.6103108Z @given( 2025-05-07T20:33:30.6103428Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.6103868Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.6104271Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.6104606Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.6104935Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.6105224Z ) 2025-05-07T20:33:30.6105574Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.6106030Z def test_silu_mul_quant( 2025-05-07T20:33:30.6106266Z self, 2025-05-07T20:33:30.6106455Z T: int, 2025-05-07T20:33:30.6106645Z D: int, 2025-05-07T20:33:30.6106860Z scale_ub: Optional[float], 2025-05-07T20:33:30.6107130Z contiguous: bool, 2025-05-07T20:33:30.6107375Z compiled: bool, 2025-05-07T20:33:30.6107597Z ) -> None: 2025-05-07T20:33:30.6107800Z torch.manual_seed(2025) 2025-05-07T20:33:30.6108040Z 2025-05-07T20:33:30.6108383Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.6110574Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.6112570Z 2025-05-07T20:33:30.6112695Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:30.6112912Z 2025-05-07T20:33:30.6113018Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.6113436Z self=, 2025-05-07T20:33:30.6113848Z T=4096, 2025-05-07T20:33:30.6114036Z D=5120, 2025-05-07T20:33:30.6114229Z scale_ub=None, 2025-05-07T20:33:30.6114447Z contiguous=True, 2025-05-07T20:33:30.6114663Z compiled=False, 2025-05-07T20:33:30.6114867Z ) 2025-05-07T20:33:30.6115191Z self = 2025-05-07T20:33:30.6115693Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:30.6115980Z 2025-05-07T20:33:30.6116059Z @given( 2025-05-07T20:33:30.6116288Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.6116602Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.6116903Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.6117235Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.6117578Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.6117863Z ) 2025-05-07T20:33:30.6118209Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.6118666Z def test_silu_mul_quant( 2025-05-07T20:33:30.6118907Z self, 2025-05-07T20:33:30.6119098Z T: int, 2025-05-07T20:33:30.6119289Z D: int, 2025-05-07T20:33:30.6119500Z scale_ub: Optional[float], 2025-05-07T20:33:30.6119777Z contiguous: bool, 2025-05-07T20:33:30.6120018Z compiled: bool, 2025-05-07T20:33:30.6120241Z ) -> None: 2025-05-07T20:33:30.6120444Z torch.manual_seed(2025) 2025-05-07T20:33:30.6120758Z 2025-05-07T20:33:30.6121034Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.6123238Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.6125294Z 2025-05-07T20:33:30.6125578Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:30.6125803Z 2025-05-07T20:33:30.6125907Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.6126325Z self=, 2025-05-07T20:33:30.6126746Z T=2048, 2025-05-07T20:33:30.6126935Z D=5120, 2025-05-07T20:33:30.6127125Z scale_ub=None, 2025-05-07T20:33:30.6127332Z contiguous=False, 2025-05-07T20:33:30.6127549Z compiled=False, 2025-05-07T20:33:30.6127745Z ) 2025-05-07T20:33:30.6128068Z self = 2025-05-07T20:33:30.6128565Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:30.6128849Z 2025-05-07T20:33:30.6128995Z @given( 2025-05-07T20:33:30.6129223Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.6129533Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.6129842Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.6130171Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.6130504Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.6130788Z ) 2025-05-07T20:33:30.6131133Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.6131585Z def test_silu_mul_quant( 2025-05-07T20:33:30.6131820Z self, 2025-05-07T20:33:30.6132010Z T: int, 2025-05-07T20:33:30.6132207Z D: int, 2025-05-07T20:33:30.6132423Z scale_ub: Optional[float], 2025-05-07T20:33:30.6132699Z contiguous: bool, 2025-05-07T20:33:30.6132931Z compiled: bool, 2025-05-07T20:33:30.6133147Z ) -> None: 2025-05-07T20:33:30.6133355Z torch.manual_seed(2025) 2025-05-07T20:33:30.6133592Z 2025-05-07T20:33:30.6133856Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.6136132Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.6138117Z 2025-05-07T20:33:30.6138236Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:30.6138453Z 2025-05-07T20:33:30.6138552Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.6138969Z self=, 2025-05-07T20:33:30.6139382Z T=4096, 2025-05-07T20:33:30.6139585Z D=7168, 2025-05-07T20:33:30.6139769Z scale_ub=None, 2025-05-07T20:33:30.6139987Z contiguous=True, 2025-05-07T20:33:30.6140212Z compiled=True, 2025-05-07T20:33:30.6140411Z ) 2025-05-07T20:33:30.6140730Z self = 2025-05-07T20:33:30.6141235Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:30.6141580Z 2025-05-07T20:33:30.6141658Z @given( 2025-05-07T20:33:30.6141888Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.6142205Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.6142512Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.6142916Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.6143254Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.6143541Z ) 2025-05-07T20:33:30.6143892Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.6144410Z def test_silu_mul_quant( 2025-05-07T20:33:30.6144659Z self, 2025-05-07T20:33:30.6144856Z T: int, 2025-05-07T20:33:30.6145055Z D: int, 2025-05-07T20:33:30.6145275Z scale_ub: Optional[float], 2025-05-07T20:33:30.6145550Z contiguous: bool, 2025-05-07T20:33:30.6145792Z compiled: bool, 2025-05-07T20:33:30.6146008Z ) -> None: 2025-05-07T20:33:30.6146219Z torch.manual_seed(2025) 2025-05-07T20:33:30.6146461Z 2025-05-07T20:33:30.6146729Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.6148956Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.6150949Z 2025-05-07T20:33:30.6151068Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:30.6151283Z 2025-05-07T20:33:30.6151387Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.6151807Z self=, 2025-05-07T20:33:30.6152226Z T=2048, 2025-05-07T20:33:30.6152413Z D=5120, 2025-05-07T20:33:30.6152603Z scale_ub=1200.0, 2025-05-07T20:33:30.6152816Z contiguous=False, 2025-05-07T20:33:30.6153035Z compiled=False, 2025-05-07T20:33:30.6153237Z ) 2025-05-07T20:33:30.6153557Z self = 2025-05-07T20:33:30.6154067Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:30.6154348Z 2025-05-07T20:33:30.6154426Z @given( 2025-05-07T20:33:30.6154655Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.6154976Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.6155288Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.6155628Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.6155964Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.6156248Z ) 2025-05-07T20:33:30.6156597Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.6157050Z def test_silu_mul_quant( 2025-05-07T20:33:30.6157290Z self, 2025-05-07T20:33:30.6157487Z T: int, 2025-05-07T20:33:30.6157696Z D: int, 2025-05-07T20:33:30.6157916Z scale_ub: Optional[float], 2025-05-07T20:33:30.6158187Z contiguous: bool, 2025-05-07T20:33:30.6158431Z compiled: bool, 2025-05-07T20:33:30.6158651Z ) -> None: 2025-05-07T20:33:30.6158856Z torch.manual_seed(2025) 2025-05-07T20:33:30.6159097Z 2025-05-07T20:33:30.6159368Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.6161536Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.6164137Z 2025-05-07T20:33:30.6164257Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:30.6164479Z 2025-05-07T20:33:30.6164582Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.6165007Z self=, 2025-05-07T20:33:30.6165517Z T=4096, 2025-05-07T20:33:30.6165701Z D=7168, 2025-05-07T20:33:30.6165894Z scale_ub=1200.0, 2025-05-07T20:33:30.6166113Z contiguous=True, 2025-05-07T20:33:30.6166338Z compiled=False, 2025-05-07T20:33:30.6166546Z ) 2025-05-07T20:33:30.7249149Z self = 2025-05-07T20:33:30.7249898Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:30.7250312Z 2025-05-07T20:33:30.7250420Z @given( 2025-05-07T20:33:30.7250772Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.7251108Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.7251417Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.7251755Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.7252194Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.7252484Z ) 2025-05-07T20:33:30.7252830Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.7253282Z def test_silu_mul_quant( 2025-05-07T20:33:30.7253526Z self, 2025-05-07T20:33:30.7253713Z T: int, 2025-05-07T20:33:30.7253906Z D: int, 2025-05-07T20:33:30.7254123Z scale_ub: Optional[float], 2025-05-07T20:33:30.7254392Z contiguous: bool, 2025-05-07T20:33:30.7254731Z compiled: bool, 2025-05-07T20:33:30.7254951Z ) -> None: 2025-05-07T20:33:30.7255159Z torch.manual_seed(2025) 2025-05-07T20:33:30.7255403Z 2025-05-07T20:33:30.7255675Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.7257867Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.7259869Z 2025-05-07T20:33:30.7259987Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:30.7260203Z 2025-05-07T20:33:30.7260303Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.7260723Z self=, 2025-05-07T20:33:30.7261136Z T=16384, 2025-05-07T20:33:30.7261322Z D=7168, 2025-05-07T20:33:30.7261514Z scale_ub=None, 2025-05-07T20:33:30.7261728Z contiguous=False, 2025-05-07T20:33:30.7261949Z compiled=True, 2025-05-07T20:33:30.7262151Z ) 2025-05-07T20:33:30.7262469Z self = 2025-05-07T20:33:30.7262975Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:30.7263266Z 2025-05-07T20:33:30.7263344Z @given( 2025-05-07T20:33:30.7263566Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.7263878Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.7264177Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.7264509Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.7264912Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.7265194Z ) 2025-05-07T20:33:30.7265543Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.7265998Z def test_silu_mul_quant( 2025-05-07T20:33:30.7266234Z self, 2025-05-07T20:33:30.7266489Z T: int, 2025-05-07T20:33:30.7266686Z D: int, 2025-05-07T20:33:30.7266911Z scale_ub: Optional[float], 2025-05-07T20:33:30.7267184Z contiguous: bool, 2025-05-07T20:33:30.7267434Z compiled: bool, 2025-05-07T20:33:30.7267716Z ) -> None: 2025-05-07T20:33:30.7267921Z torch.manual_seed(2025) 2025-05-07T20:33:30.7268160Z 2025-05-07T20:33:30.7268432Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.7270609Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.7272614Z 2025-05-07T20:33:30.7272729Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:30.7272993Z 2025-05-07T20:33:30.7273096Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.7273518Z self=, 2025-05-07T20:33:30.7273927Z T=4096, 2025-05-07T20:33:30.7274106Z D=7168, 2025-05-07T20:33:30.7274292Z scale_ub=None, 2025-05-07T20:33:30.7274510Z contiguous=True, 2025-05-07T20:33:30.7274730Z compiled=False, 2025-05-07T20:33:30.7274932Z ) 2025-05-07T20:33:30.7275250Z self = 2025-05-07T20:33:30.7275751Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:30.7276037Z 2025-05-07T20:33:30.7276113Z @given( 2025-05-07T20:33:30.7276336Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.7276646Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.7276951Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.7277280Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.7277610Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.7277896Z ) 2025-05-07T20:33:30.7278243Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.7278699Z def test_silu_mul_quant( 2025-05-07T20:33:30.7278937Z self, 2025-05-07T20:33:30.7279131Z T: int, 2025-05-07T20:33:30.7279320Z D: int, 2025-05-07T20:33:30.7279529Z scale_ub: Optional[float], 2025-05-07T20:33:30.7279804Z contiguous: bool, 2025-05-07T20:33:30.7280046Z compiled: bool, 2025-05-07T20:33:30.7280257Z ) -> None: 2025-05-07T20:33:30.7280468Z torch.manual_seed(2025) 2025-05-07T20:33:30.7280711Z 2025-05-07T20:33:30.7280977Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.7283163Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.7285166Z 2025-05-07T20:33:30.7285283Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:30.7285553Z 2025-05-07T20:33:30.7285653Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.7286075Z self=, 2025-05-07T20:33:30.7286484Z T=16384, 2025-05-07T20:33:30.7286674Z D=7168, 2025-05-07T20:33:30.7286924Z scale_ub=None, 2025-05-07T20:33:30.7287129Z contiguous=True, 2025-05-07T20:33:30.7287346Z compiled=False, 2025-05-07T20:33:30.7287544Z ) 2025-05-07T20:33:30.7287856Z self = 2025-05-07T20:33:30.7288401Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:30.7288695Z 2025-05-07T20:33:30.7288770Z @given( 2025-05-07T20:33:30.7288988Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.7289297Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.7289601Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.7289933Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.7290263Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.7290546Z ) 2025-05-07T20:33:30.7290898Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.7291359Z def test_silu_mul_quant( 2025-05-07T20:33:30.7298533Z self, 2025-05-07T20:33:30.7298750Z T: int, 2025-05-07T20:33:30.7298949Z D: int, 2025-05-07T20:33:30.7299250Z scale_ub: Optional[float], 2025-05-07T20:33:30.7299537Z contiguous: bool, 2025-05-07T20:33:30.7299776Z compiled: bool, 2025-05-07T20:33:30.7300002Z ) -> None: 2025-05-07T20:33:30.7300219Z torch.manual_seed(2025) 2025-05-07T20:33:30.7300465Z 2025-05-07T20:33:30.7300745Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.7302946Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.7304958Z 2025-05-07T20:33:30.7305084Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:30.7305309Z 2025-05-07T20:33:30.7305422Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.7305844Z self=, 2025-05-07T20:33:30.7306261Z T=16384, 2025-05-07T20:33:30.7306456Z D=7168, 2025-05-07T20:33:30.7306646Z scale_ub=1200.0, 2025-05-07T20:33:30.7306875Z contiguous=True, 2025-05-07T20:33:30.7307101Z compiled=False, 2025-05-07T20:33:30.7307307Z ) 2025-05-07T20:33:30.7307634Z self = 2025-05-07T20:33:30.7308150Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:30.7308439Z 2025-05-07T20:33:30.7308519Z @given( 2025-05-07T20:33:30.7308748Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.7309067Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.7309382Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.7309712Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.7310049Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.7310341Z ) 2025-05-07T20:33:30.7310689Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.7311151Z def test_silu_mul_quant( 2025-05-07T20:33:30.7311397Z self, 2025-05-07T20:33:30.7311589Z T: int, 2025-05-07T20:33:30.7311788Z D: int, 2025-05-07T20:33:30.7312062Z scale_ub: Optional[float], 2025-05-07T20:33:30.7312332Z contiguous: bool, 2025-05-07T20:33:30.7312578Z compiled: bool, 2025-05-07T20:33:30.7312806Z ) -> None: 2025-05-07T20:33:30.7313021Z torch.manual_seed(2025) 2025-05-07T20:33:30.7313269Z 2025-05-07T20:33:30.7313592Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.7315788Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.7317833Z 2025-05-07T20:33:30.7317960Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:30.7318181Z 2025-05-07T20:33:30.7318288Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.7318721Z self=, 2025-05-07T20:33:30.7319152Z T=128, 2025-05-07T20:33:30.7319344Z D=5120, 2025-05-07T20:33:30.7319537Z scale_ub=1200.0, 2025-05-07T20:33:30.7319763Z contiguous=False, 2025-05-07T20:33:30.7320026Z compiled=False, 2025-05-07T20:33:30.7320232Z ) 2025-05-07T20:33:30.8618652Z self = 2025-05-07T20:33:30.8619414Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:30.8619828Z 2025-05-07T20:33:30.8619930Z @given( 2025-05-07T20:33:30.8620161Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.8620480Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.8620794Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.8621135Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.8621471Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.8621765Z ) 2025-05-07T20:33:30.8622118Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.8622580Z def test_silu_mul_quant( 2025-05-07T20:33:30.8622827Z self, 2025-05-07T20:33:30.8623028Z T: int, 2025-05-07T20:33:30.8623238Z D: int, 2025-05-07T20:33:30.8623469Z scale_ub: Optional[float], 2025-05-07T20:33:30.8623784Z contiguous: bool, 2025-05-07T20:33:30.8624034Z compiled: bool, 2025-05-07T20:33:30.8624282Z ) -> None: 2025-05-07T20:33:30.8624534Z torch.manual_seed(2025) 2025-05-07T20:33:30.8624777Z 2025-05-07T20:33:30.8625052Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.8625582Z 2025-05-07T20:33:30.8625781Z x_sign = torch.sign(x) 2025-05-07T20:33:30.8626076Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.8626395Z x = x_sign * x_clamp 2025-05-07T20:33:30.8626631Z x0 = x[:, :D] 2025-05-07T20:33:30.8626848Z x1 = x[:, D:] 2025-05-07T20:33:30.8627059Z 2025-05-07T20:33:30.8627243Z if contiguous: 2025-05-07T20:33:30.8627483Z x0 = x0.contiguous() 2025-05-07T20:33:30.8627748Z x1 = x1.contiguous() 2025-05-07T20:33:30.8627993Z 2025-05-07T20:33:30.8628179Z if scale_ub is not None: 2025-05-07T20:33:30.8628466Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:30.8628804Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:30.8629116Z ) 2025-05-07T20:33:30.8629309Z else: 2025-05-07T20:33:30.8629521Z scale_ub_tensor = None 2025-05-07T20:33:30.8629773Z 2025-05-07T20:33:30.8630006Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.8630444Z op = silu_mul_quant 2025-05-07T20:33:30.8630696Z if compiled: 2025-05-07T20:33:30.8630942Z op = torch.compile(op) 2025-05-07T20:33:30.8631246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.8631526Z 2025-05-07T20:33:30.8631779Z > y_fp8, y_scale = fn() 2025-05-07T20:33:30.8631946Z 2025-05-07T20:33:30.8632051Z moe/activation_test.py:117: 2025-05-07T20:33:30.8632356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.8632696Z moe/activation_test.py:115: in fn 2025-05-07T20:33:30.8633042Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.8633767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:30.8634489Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:30.8635050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:30.8635775Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:30.8636465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:30.8637022Z kernel = self.compile( 2025-05-07T20:33:30.8637590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:30.8638339Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:30.8638747Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.8638989Z 2025-05-07T20:33:30.8639203Z self = 2025-05-07T20:33:30.8640330Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:30.8641758Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0527f147c0>} 2025-05-07T20:33:30.8643166Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:30.8644246Z context = 2025-05-07T20:33:30.8644547Z 2025-05-07T20:33:30.8644716Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:30.8645255Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:30.8645730Z module_map=module_map) 2025-05-07T20:33:30.8646086Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:30.8646448Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:30.8646711Z E ^ 2025-05-07T20:33:30.8647182Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:30.8647659Z 2025-05-07T20:33:30.8648098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:30.8648644Z 2025-05-07T20:33:30.8648746Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.8649167Z self=, 2025-05-07T20:33:30.8649574Z T=2048, 2025-05-07T20:33:30.8649759Z D=7168, 2025-05-07T20:33:30.8649954Z scale_ub=None, 2025-05-07T20:33:30.8650162Z contiguous=False, 2025-05-07T20:33:30.8650388Z compiled=False, 2025-05-07T20:33:30.8650593Z ) 2025-05-07T20:33:30.8650907Z self = 2025-05-07T20:33:30.8651466Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:30.8651752Z 2025-05-07T20:33:30.8651842Z @given( 2025-05-07T20:33:30.8652070Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.8652381Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.8652729Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.8653065Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.8653389Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.8653713Z ) 2025-05-07T20:33:30.8654067Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.8654594Z def test_silu_mul_quant( 2025-05-07T20:33:30.8654836Z self, 2025-05-07T20:33:30.8655031Z T: int, 2025-05-07T20:33:30.8655220Z D: int, 2025-05-07T20:33:30.8655430Z scale_ub: Optional[float], 2025-05-07T20:33:30.8655701Z contiguous: bool, 2025-05-07T20:33:30.8655935Z compiled: bool, 2025-05-07T20:33:30.8656155Z ) -> None: 2025-05-07T20:33:30.8656360Z torch.manual_seed(2025) 2025-05-07T20:33:30.8656594Z 2025-05-07T20:33:30.8656865Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.8659102Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.8661103Z 2025-05-07T20:33:30.8661222Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:30.8661439Z 2025-05-07T20:33:30.8661550Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.8661973Z self=, 2025-05-07T20:33:30.8662395Z T=128, 2025-05-07T20:33:30.8662588Z D=7168, 2025-05-07T20:33:30.8662783Z scale_ub=1200.0, 2025-05-07T20:33:30.8663007Z contiguous=True, 2025-05-07T20:33:30.8663234Z compiled=True, 2025-05-07T20:33:30.8663441Z ) 2025-05-07T20:33:30.8974642Z self = 2025-05-07T20:33:30.8976228Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:30.8977011Z 2025-05-07T20:33:30.8977225Z @given( 2025-05-07T20:33:30.8977825Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.8978455Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.8979059Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.8979728Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.8980397Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.8980965Z ) 2025-05-07T20:33:30.8981659Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.8982559Z def test_silu_mul_quant( 2025-05-07T20:33:30.8983047Z self, 2025-05-07T20:33:30.8983429Z T: int, 2025-05-07T20:33:30.8983823Z D: int, 2025-05-07T20:33:30.8984214Z scale_ub: Optional[float], 2025-05-07T20:33:30.8984539Z contiguous: bool, 2025-05-07T20:33:30.8984800Z compiled: bool, 2025-05-07T20:33:30.8985024Z ) -> None: 2025-05-07T20:33:30.8985235Z torch.manual_seed(2025) 2025-05-07T20:33:30.8985481Z 2025-05-07T20:33:30.8985749Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.8986099Z 2025-05-07T20:33:30.8986288Z x_sign = torch.sign(x) 2025-05-07T20:33:30.8986576Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.8986995Z x = x_sign * x_clamp 2025-05-07T20:33:30.8987234Z x0 = x[:, :D] 2025-05-07T20:33:30.8987446Z x1 = x[:, D:] 2025-05-07T20:33:30.8987653Z 2025-05-07T20:33:30.8987839Z if contiguous: 2025-05-07T20:33:30.8988069Z x0 = x0.contiguous() 2025-05-07T20:33:30.8988416Z x1 = x1.contiguous() 2025-05-07T20:33:30.8988666Z 2025-05-07T20:33:30.8988857Z if scale_ub is not None: 2025-05-07T20:33:30.8989131Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:30.8989468Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:30.8989863Z ) 2025-05-07T20:33:30.8990057Z else: 2025-05-07T20:33:30.8990272Z scale_ub_tensor = None 2025-05-07T20:33:30.8990525Z 2025-05-07T20:33:30.8990750Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:30.8991072Z op = silu_mul_quant 2025-05-07T20:33:30.8991324Z if compiled: 2025-05-07T20:33:30.8991567Z op = torch.compile(op) 2025-05-07T20:33:30.8991868Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.8992146Z 2025-05-07T20:33:30.8992332Z > y_fp8, y_scale = fn() 2025-05-07T20:33:30.8992500Z 2025-05-07T20:33:30.8992600Z moe/activation_test.py:117: 2025-05-07T20:33:30.8992905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.8993247Z moe/activation_test.py:115: in fn 2025-05-07T20:33:30.8993592Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:30.8994176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:30.8994761Z return fn(*args, **kwargs) 2025-05-07T20:33:30.8995440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:30.8996166Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:30.8996727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:30.8997444Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:30.8998132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:30.8998683Z kernel = self.compile( 2025-05-07T20:33:30.8999246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:30.8999928Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:30.9000336Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:30.9000579Z 2025-05-07T20:33:30.9000786Z self = 2025-05-07T20:33:30.9001907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:30.9003340Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0527f15940>} 2025-05-07T20:33:30.9004746Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:30.9005828Z context = 2025-05-07T20:33:30.9006132Z 2025-05-07T20:33:30.9006302Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:30.9006838Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:30.9007313Z module_map=module_map) 2025-05-07T20:33:30.9007735Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:30.9008101Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:30.9008367Z E ^ 2025-05-07T20:33:30.9008891Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:30.9009370Z 2025-05-07T20:33:30.9009808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:30.9010347Z 2025-05-07T20:33:30.9010457Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.9010913Z self=, 2025-05-07T20:33:30.9011330Z T=128, 2025-05-07T20:33:30.9011516Z D=7168, 2025-05-07T20:33:30.9011710Z scale_ub=1200.0, 2025-05-07T20:33:30.9011931Z contiguous=True, 2025-05-07T20:33:30.9012146Z compiled=False, 2025-05-07T20:33:30.9012346Z ) 2025-05-07T20:33:30.9012665Z self = 2025-05-07T20:33:30.9013170Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:30.9013448Z 2025-05-07T20:33:30.9013529Z @given( 2025-05-07T20:33:30.9013748Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.9014066Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.9014377Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.9014860Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.9015193Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.9015492Z ) 2025-05-07T20:33:30.9015839Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.9016290Z def test_silu_mul_quant( 2025-05-07T20:33:30.9016531Z self, 2025-05-07T20:33:30.9016726Z T: int, 2025-05-07T20:33:30.9016920Z D: int, 2025-05-07T20:33:30.9017133Z scale_ub: Optional[float], 2025-05-07T20:33:30.9017402Z contiguous: bool, 2025-05-07T20:33:30.9017635Z compiled: bool, 2025-05-07T20:33:30.9017849Z ) -> None: 2025-05-07T20:33:30.9018060Z torch.manual_seed(2025) 2025-05-07T20:33:30.9018297Z 2025-05-07T20:33:30.9018572Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.9018920Z 2025-05-07T20:33:30.9019106Z x_sign = torch.sign(x) 2025-05-07T20:33:30.9019400Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.9021532Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.9023528Z 2025-05-07T20:33:30.9023643Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:30.9023859Z 2025-05-07T20:33:30.9023966Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.9024380Z self=, 2025-05-07T20:33:30.9024794Z T=128, 2025-05-07T20:33:30.9024990Z D=5120, 2025-05-07T20:33:30.9025180Z scale_ub=1200.0, 2025-05-07T20:33:30.9025572Z contiguous=True, 2025-05-07T20:33:30.9025803Z compiled=True, 2025-05-07T20:33:30.9026006Z ) 2025-05-07T20:33:30.9026336Z self = 2025-05-07T20:33:30.9026844Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:30.9027123Z 2025-05-07T20:33:30.9027204Z @given( 2025-05-07T20:33:30.9027428Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:30.9027819Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:30.9028129Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:30.9028462Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:30.9028859Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:30.9029157Z ) 2025-05-07T20:33:30.9029508Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:30.9029965Z def test_silu_mul_quant( 2025-05-07T20:33:30.9030216Z self, 2025-05-07T20:33:30.9030468Z T: int, 2025-05-07T20:33:30.9030667Z D: int, 2025-05-07T20:33:30.9030882Z scale_ub: Optional[float], 2025-05-07T20:33:30.9031156Z contiguous: bool, 2025-05-07T20:33:30.9031397Z compiled: bool, 2025-05-07T20:33:30.9031615Z ) -> None: 2025-05-07T20:33:30.9031817Z torch.manual_seed(2025) 2025-05-07T20:33:30.9032057Z 2025-05-07T20:33:30.9032337Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:30.9032689Z 2025-05-07T20:33:30.9032875Z x_sign = torch.sign(x) 2025-05-07T20:33:30.9033168Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:30.9035355Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:30.9037349Z 2025-05-07T20:33:30.9037473Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:30.9037693Z 2025-05-07T20:33:30.9037796Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:30.9038221Z self=, 2025-05-07T20:33:30.9038648Z T=128, 2025-05-07T20:33:30.9038835Z D=7168, 2025-05-07T20:33:30.9039015Z scale_ub=None, 2025-05-07T20:33:30.9039225Z contiguous=True, 2025-05-07T20:33:30.9039445Z compiled=True, 2025-05-07T20:33:30.9039638Z ) 2025-05-07T20:33:31.1704608Z self = 2025-05-07T20:33:31.1705375Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:31.1705770Z 2025-05-07T20:33:31.1705873Z @given( 2025-05-07T20:33:31.1706117Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.1706438Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.1706758Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.1707106Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.1707448Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.1707755Z ) 2025-05-07T20:33:31.1708115Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.1708584Z def test_silu_mul_quant( 2025-05-07T20:33:31.1708832Z self, 2025-05-07T20:33:31.1709040Z T: int, 2025-05-07T20:33:31.1709249Z D: int, 2025-05-07T20:33:31.1709472Z scale_ub: Optional[float], 2025-05-07T20:33:31.1709754Z contiguous: bool, 2025-05-07T20:33:31.1710003Z compiled: bool, 2025-05-07T20:33:31.1710229Z ) -> None: 2025-05-07T20:33:31.1710447Z torch.manual_seed(2025) 2025-05-07T20:33:31.1710697Z 2025-05-07T20:33:31.1710973Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.1713276Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.1715323Z 2025-05-07T20:33:31.1715444Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:31.1715670Z 2025-05-07T20:33:31.1733150Z FAILED 2025-05-07T20:33:31.1733347Z 2025-05-07T20:33:31.1733713Z =================================== FAILURES =================================== 2025-05-07T20:33:31.1734364Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:31.1735126Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:31.1735983Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:33:31.1736775Z | yield 2025-05-07T20:33:31.1737383Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run 2025-05-07T20:33:31.1738110Z | self._callTestMethod(testMethod) 2025-05-07T20:33:31.1738907Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod 2025-05-07T20:33:31.1739721Z | if method() is not None: 2025-05-07T20:33:31.1740066Z | ^^^^^^^^ 2025-05-07T20:33:31.1741095Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:31.1742159Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.1753064Z | ^^^^^^^ 2025-05-07T20:33:31.1753841Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:31.1754803Z | raise the_error_hypothesis_found 2025-05-07T20:33:31.1755420Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:31.1756032Z +-+---------------- 1 ---------------- 2025-05-07T20:33:31.1756453Z | Traceback (most recent call last): 2025-05-07T20:33:31.1757502Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:31.1758632Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.1759426Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:31.1762287Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.1765164Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:31.1765799Z | self=, 2025-05-07T20:33:31.1766414Z | T=2048, 2025-05-07T20:33:31.1766743Z | D=5120, # or any other generated value 2025-05-07T20:33:31.1767242Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:31.1767807Z | contiguous=True, # or any other generated value 2025-05-07T20:33:31.1768322Z | compiled=False, # or any other generated value 2025-05-07T20:33:31.1768746Z | ) 2025-05-07T20:33:31.1769008Z | 2025-05-07T20:33:31.1769769Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:31.1770774Z +---------------- 2 ---------------- 2025-05-07T20:33:31.1771178Z | Traceback (most recent call last): 2025-05-07T20:33:31.1772292Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:31.1773458Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.1773987Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:31.1777057Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.1779172Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:31.1779628Z | self=, 2025-05-07T20:33:31.1780053Z | T=128, 2025-05-07T20:33:31.1780253Z | D=7168, 2025-05-07T20:33:31.1780466Z | scale_ub=None, 2025-05-07T20:33:31.1780761Z | contiguous=True, 2025-05-07T20:33:31.1781003Z | compiled=True, 2025-05-07T20:33:31.1781234Z | ) 2025-05-07T20:33:31.1781416Z | 2025-05-07T20:33:31.1781950Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:31.1782579Z +---------------- 3 ---------------- 2025-05-07T20:33:31.1782877Z | Traceback (most recent call last): 2025-05-07T20:33:31.1783618Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:31.1784424Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.1784812Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:31.1786923Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.1789020Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:31.1789472Z | self=, 2025-05-07T20:33:31.1789893Z | T=128, 2025-05-07T20:33:31.1790100Z | D=5120, 2025-05-07T20:33:31.1790317Z | scale_ub=1200.0, 2025-05-07T20:33:31.1790557Z | contiguous=True, 2025-05-07T20:33:31.1790805Z | compiled=True, 2025-05-07T20:33:31.1791035Z | ) 2025-05-07T20:33:31.1791212Z | 2025-05-07T20:33:31.1791757Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:31.1792392Z +---------------- 4 ---------------- 2025-05-07T20:33:31.1792725Z | Traceback (most recent call last): 2025-05-07T20:33:31.1793722Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:31.1794838Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:31.1795325Z | ^^^^^^^^ 2025-05-07T20:33:31.1796247Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:31.1797351Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.1797868Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:31.1799081Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:31.1800138Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:31.1800792Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:31.1801558Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.1802016Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:31.1802696Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:31.1803517Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:31.1804166Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:31.1805145Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:31.1806178Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:31.1806732Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:31.1807603Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:31.1808418Z | fn() 2025-05-07T20:33:31.1809245Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:31.1810171Z | self.fn.run( 2025-05-07T20:33:31.1810927Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:31.1811781Z | kernel = self.compile( 2025-05-07T20:33:31.1812169Z | ^^^^^^^^^^^^^ 2025-05-07T20:33:31.1813048Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:31.1814073Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.1814776Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:31.1815720Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:31.1816868Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.1817582Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:31.1818135Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.1818648Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:31.1819030Z | ^ 2025-05-07T20:33:31.1819696Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.1820531Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:31.1821102Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:31.1821828Z | self=, 2025-05-07T20:33:31.1822462Z | T=1, # or any other generated value 2025-05-07T20:33:31.1822986Z | D=5120, # or any other generated value 2025-05-07T20:33:31.1823465Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:31.1823986Z | contiguous=True, # or any other generated value 2025-05-07T20:33:31.1824563Z | compiled=True, # or any other generated value 2025-05-07T20:33:31.1825053Z | ) 2025-05-07T20:33:31.1825326Z | 2025-05-07T20:33:31.1826396Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:31.1827440Z +------------------------------------ 2025-05-07T20:33:31.1827942Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:31.1828479Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.1829061Z self=, 2025-05-07T20:33:31.1829645Z T=1, 2025-05-07T20:33:31.1829923Z D=5120, 2025-05-07T20:33:31.1830212Z scale_ub=None, 2025-05-07T20:33:31.1830511Z contiguous=True, 2025-05-07T20:33:31.1830826Z compiled=True, 2025-05-07T20:33:31.1832045Z ) 2025-05-07T20:33:31.1832501Z self = 2025-05-07T20:33:31.1833200Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:31.1833577Z 2025-05-07T20:33:31.1833697Z @given( 2025-05-07T20:33:31.1834113Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.1834541Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.1834981Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.1835452Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.1835917Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.1836347Z ) 2025-05-07T20:33:31.1836853Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.1837507Z def test_silu_mul_quant( 2025-05-07T20:33:31.1837854Z self, 2025-05-07T20:33:31.1838134Z T: int, 2025-05-07T20:33:31.1838403Z D: int, 2025-05-07T20:33:31.1838717Z scale_ub: Optional[float], 2025-05-07T20:33:31.1839093Z contiguous: bool, 2025-05-07T20:33:31.1839426Z compiled: bool, 2025-05-07T20:33:31.1839758Z ) -> None: 2025-05-07T20:33:31.1840054Z torch.manual_seed(2025) 2025-05-07T20:33:31.1840378Z 2025-05-07T20:33:31.1840755Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.1841231Z 2025-05-07T20:33:31.1841489Z x_sign = torch.sign(x) 2025-05-07T20:33:31.1841866Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.1842271Z x = x_sign * x_clamp 2025-05-07T20:33:31.1842584Z x0 = x[:, :D] 2025-05-07T20:33:31.1842871Z x1 = x[:, D:] 2025-05-07T20:33:31.1843161Z 2025-05-07T20:33:31.1843416Z if contiguous: 2025-05-07T20:33:31.1843733Z x0 = x0.contiguous() 2025-05-07T20:33:31.1844093Z x1 = x1.contiguous() 2025-05-07T20:33:31.1844433Z 2025-05-07T20:33:31.1844703Z if scale_ub is not None: 2025-05-07T20:33:31.1845086Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.1845540Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.1845954Z ) 2025-05-07T20:33:31.1846211Z else: 2025-05-07T20:33:31.1846490Z scale_ub_tensor = None 2025-05-07T20:33:31.1846823Z 2025-05-07T20:33:31.1847132Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.1847542Z op = silu_mul_quant 2025-05-07T20:33:31.1847899Z if compiled: 2025-05-07T20:33:31.1848238Z op = torch.compile(op) 2025-05-07T20:33:31.1848663Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.1849063Z 2025-05-07T20:33:31.1849324Z y_fp8, y_scale = fn() 2025-05-07T20:33:31.1849728Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:31.1850266Z 2025-05-07T20:33:31.1850590Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.1851069Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:31.1851488Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:31.1851981Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:31.1852467Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.1852879Z 2025-05-07T20:33:31.1853145Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:31.1853534Z 2025-05-07T20:33:31.1853667Z moe/activation_test.py:126: 2025-05-07T20:33:31.1854059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.1854636Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:31.1855140Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.1856232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:31.1857264Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:31.1857994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.1858914Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.1859894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:31.1860892Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:31.1861894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:31.1862834Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:31.1863681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:31.1864416Z fn() 2025-05-07T20:33:31.1865114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:31.1865938Z self.fn.run( 2025-05-07T20:33:31.1866579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.1867311Z kernel = self.compile( 2025-05-07T20:33:31.1868053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.1868959Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.1869504Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.1869820Z 2025-05-07T20:33:31.1870096Z self = 2025-05-07T20:33:31.1871573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.1873520Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09a057dc60>} 2025-05-07T20:33:31.1875508Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.1877034Z context = 2025-05-07T20:33:31.1877455Z 2025-05-07T20:33:31.1877696Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.1878424Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.1879163Z module_map=module_map) 2025-05-07T20:33:31.1879664Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.1880193Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:31.1880560Z E ^ 2025-05-07T20:33:31.1881252Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.1881870Z 2025-05-07T20:33:31.1882431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.1883174Z 2025-05-07T20:33:31.1883312Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.1883852Z self=, 2025-05-07T20:33:31.1884372Z T=2048, 2025-05-07T20:33:31.1884618Z D=5120, 2025-05-07T20:33:31.1884881Z scale_ub=1200.0, 2025-05-07T20:33:31.1885211Z contiguous=True, 2025-05-07T20:33:31.1885516Z compiled=False, 2025-05-07T20:33:31.1885788Z ) 2025-05-07T20:33:31.1886212Z self = 2025-05-07T20:33:31.1886865Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:31.1887239Z 2025-05-07T20:33:31.1887346Z @given( 2025-05-07T20:33:31.1887672Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.1888078Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.1888536Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.1888994Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.1889448Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.1889839Z ) 2025-05-07T20:33:31.1890338Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.1890997Z def test_silu_mul_quant( 2025-05-07T20:33:31.1891324Z self, 2025-05-07T20:33:31.1891583Z T: int, 2025-05-07T20:33:31.1891850Z D: int, 2025-05-07T20:33:31.1892139Z scale_ub: Optional[float], 2025-05-07T20:33:31.1892509Z contiguous: bool, 2025-05-07T20:33:31.1892831Z compiled: bool, 2025-05-07T20:33:31.1893123Z ) -> None: 2025-05-07T20:33:31.1893408Z torch.manual_seed(2025) 2025-05-07T20:33:31.1893736Z 2025-05-07T20:33:31.1894089Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.1894665Z 2025-05-07T20:33:31.1894931Z x_sign = torch.sign(x) 2025-05-07T20:33:31.1895314Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.1895729Z x = x_sign * x_clamp 2025-05-07T20:33:31.1896047Z x0 = x[:, :D] 2025-05-07T20:33:31.1896332Z x1 = x[:, D:] 2025-05-07T20:33:31.1896615Z 2025-05-07T20:33:31.1896871Z if contiguous: 2025-05-07T20:33:31.1897188Z x0 = x0.contiguous() 2025-05-07T20:33:31.1897562Z x1 = x1.contiguous() 2025-05-07T20:33:31.1897912Z 2025-05-07T20:33:31.1898195Z if scale_ub is not None: 2025-05-07T20:33:31.1898566Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.1899000Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.1899428Z ) 2025-05-07T20:33:31.1899678Z else: 2025-05-07T20:33:31.1899957Z scale_ub_tensor = None 2025-05-07T20:33:31.1900317Z 2025-05-07T20:33:31.1900625Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.1901068Z op = silu_mul_quant 2025-05-07T20:33:31.1901421Z if compiled: 2025-05-07T20:33:31.1901767Z op = torch.compile(op) 2025-05-07T20:33:31.1902183Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.1902582Z 2025-05-07T20:33:31.1902831Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.1903058Z 2025-05-07T20:33:31.1903189Z moe/activation_test.py:117: 2025-05-07T20:33:31.1903587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.1904116Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.1904494Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.1905462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.1906512Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.1907303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.1908321Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.1909361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.1910156Z kernel = self.compile( 2025-05-07T20:33:31.1910895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.1911805Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.1912379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.1912721Z 2025-05-07T20:33:31.1913025Z self = 2025-05-07T20:33:31.1914661Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.1916665Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09a03d4220>} 2025-05-07T20:33:31.1918646Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.1920137Z context = 2025-05-07T20:33:31.1920548Z 2025-05-07T20:33:31.1920777Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.1921523Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.1922213Z module_map=module_map) 2025-05-07T20:33:31.1922726Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.1923216Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.1923575Z E ^ 2025-05-07T20:33:31.1924239Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.1924903Z 2025-05-07T20:33:31.1925720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.1926494Z 2025-05-07T20:33:31.1926643Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.1927240Z self=, 2025-05-07T20:33:31.1927827Z T=2048, 2025-05-07T20:33:31.1928086Z D=5120, 2025-05-07T20:33:31.1928350Z scale_ub=1200.0, 2025-05-07T20:33:31.1928664Z contiguous=True, 2025-05-07T20:33:31.1928970Z compiled=True, 2025-05-07T20:33:31.1929257Z ) 2025-05-07T20:33:31.1929713Z self = 2025-05-07T20:33:31.1930414Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:31.1930804Z 2025-05-07T20:33:31.1930915Z @given( 2025-05-07T20:33:31.1931242Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.1931680Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.1932110Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.1932577Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.1933036Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.1933641Z ) 2025-05-07T20:33:31.1934131Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.1934880Z def test_silu_mul_quant( 2025-05-07T20:33:31.1935216Z self, 2025-05-07T20:33:31.1935480Z T: int, 2025-05-07T20:33:31.1935868Z D: int, 2025-05-07T20:33:31.1936174Z scale_ub: Optional[float], 2025-05-07T20:33:31.1936575Z contiguous: bool, 2025-05-07T20:33:31.1936926Z compiled: bool, 2025-05-07T20:33:31.1937248Z ) -> None: 2025-05-07T20:33:31.1937557Z torch.manual_seed(2025) 2025-05-07T20:33:31.1937989Z 2025-05-07T20:33:31.1938382Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.1938868Z 2025-05-07T20:33:31.1939140Z x_sign = torch.sign(x) 2025-05-07T20:33:31.1939547Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.1939976Z x = x_sign * x_clamp 2025-05-07T20:33:31.1940303Z x0 = x[:, :D] 2025-05-07T20:33:31.1940598Z x1 = x[:, D:] 2025-05-07T20:33:31.1940868Z 2025-05-07T20:33:31.1941114Z if contiguous: 2025-05-07T20:33:31.1941446Z x0 = x0.contiguous() 2025-05-07T20:33:31.1941777Z x1 = x1.contiguous() 2025-05-07T20:33:31.1942087Z 2025-05-07T20:33:31.1942355Z if scale_ub is not None: 2025-05-07T20:33:31.1942695Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.1943209Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.1943656Z ) 2025-05-07T20:33:31.1943928Z else: 2025-05-07T20:33:31.1944233Z scale_ub_tensor = None 2025-05-07T20:33:31.1944640Z 2025-05-07T20:33:31.1944953Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.1945392Z op = silu_mul_quant 2025-05-07T20:33:31.1945882Z if compiled: 2025-05-07T20:33:31.1946342Z op = torch.compile(op) 2025-05-07T20:33:31.1946820Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.1947623Z 2025-05-07T20:33:31.1947961Z y_fp8, y_scale = fn() 2025-05-07T20:33:31.1966969Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:31.1967276Z 2025-05-07T20:33:31.1967531Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.1967875Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:31.1968170Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:31.1968490Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:31.1968856Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.1969166Z 2025-05-07T20:33:31.1969365Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:31.1969565Z 2025-05-07T20:33:31.1969668Z moe/activation_test.py:126: 2025-05-07T20:33:31.1969964Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.1970306Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:31.1970638Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.1971459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:31.1972247Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:31.1972812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.1973543Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.1974276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:31.1975150Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:31.1975933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:31.1976710Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:31.1977347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:31.1977903Z fn() 2025-05-07T20:33:31.1979716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:31.1980342Z self.fn.run( 2025-05-07T20:33:31.1980841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.1981464Z kernel = self.compile( 2025-05-07T20:33:31.1982042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.1982738Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.1983155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.1983412Z 2025-05-07T20:33:31.1983626Z self = 2025-05-07T20:33:31.1984813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.1986302Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09a03d56c0>} 2025-05-07T20:33:31.1987732Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.1988833Z context = 2025-05-07T20:33:31.1989139Z 2025-05-07T20:33:31.1989316Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.1989861Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.1990341Z module_map=module_map) 2025-05-07T20:33:31.1990718Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.1991088Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:31.1991363Z E ^ 2025-05-07T20:33:31.1991856Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.1992336Z 2025-05-07T20:33:31.1992777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.1993322Z 2025-05-07T20:33:31.1993429Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.1993849Z self=, 2025-05-07T20:33:31.1994269Z T=16384, 2025-05-07T20:33:31.1994522Z D=7168, 2025-05-07T20:33:31.1994738Z scale_ub=1200.0, 2025-05-07T20:33:31.1994976Z contiguous=False, 2025-05-07T20:33:31.1995212Z compiled=False, 2025-05-07T20:33:31.1995427Z ) 2025-05-07T20:33:31.1995769Z self = 2025-05-07T20:33:31.1996312Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:31.1996609Z 2025-05-07T20:33:31.1996699Z @given( 2025-05-07T20:33:31.1996942Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.1997274Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.1997594Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.1997941Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.1998287Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.1998600Z ) 2025-05-07T20:33:31.1998967Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.1999496Z def test_silu_mul_quant( 2025-05-07T20:33:31.1999749Z self, 2025-05-07T20:33:31.1999951Z T: int, 2025-05-07T20:33:31.2000149Z D: int, 2025-05-07T20:33:31.2000375Z scale_ub: Optional[float], 2025-05-07T20:33:31.2000665Z contiguous: bool, 2025-05-07T20:33:31.2000962Z compiled: bool, 2025-05-07T20:33:31.2001194Z ) -> None: 2025-05-07T20:33:31.2001418Z torch.manual_seed(2025) 2025-05-07T20:33:31.2001657Z 2025-05-07T20:33:31.2001935Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2002327Z 2025-05-07T20:33:31.2002518Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2002814Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2003129Z x = x_sign * x_clamp 2025-05-07T20:33:31.2003371Z x0 = x[:, :D] 2025-05-07T20:33:31.2003597Z x1 = x[:, D:] 2025-05-07T20:33:31.2003800Z 2025-05-07T20:33:31.2003979Z if contiguous: 2025-05-07T20:33:31.2004212Z x0 = x0.contiguous() 2025-05-07T20:33:31.2004490Z x1 = x1.contiguous() 2025-05-07T20:33:31.2004744Z 2025-05-07T20:33:31.2004984Z if scale_ub is not None: 2025-05-07T20:33:31.2005297Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2005647Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2005987Z ) 2025-05-07T20:33:31.2006202Z else: 2025-05-07T20:33:31.2006498Z scale_ub_tensor = None 2025-05-07T20:33:31.2006776Z 2025-05-07T20:33:31.2007030Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2007378Z op = silu_mul_quant 2025-05-07T20:33:31.2007644Z if compiled: 2025-05-07T20:33:31.2007915Z op = torch.compile(op) 2025-05-07T20:33:31.2008241Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2008534Z 2025-05-07T20:33:31.2008747Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2008922Z 2025-05-07T20:33:31.2009043Z moe/activation_test.py:117: 2025-05-07T20:33:31.2009354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2009720Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2010024Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2010750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2011494Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2012068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2012794Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2013495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2014067Z kernel = self.compile( 2025-05-07T20:33:31.2014689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2015388Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2015800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2016057Z 2025-05-07T20:33:31.2016273Z self = 2025-05-07T20:33:31.2017411Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2018846Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099b2c0180>} 2025-05-07T20:33:31.2020263Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2021427Z context = 2025-05-07T20:33:31.2021740Z 2025-05-07T20:33:31.2021953Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2022505Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2022997Z module_map=module_map) 2025-05-07T20:33:31.2023429Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2023808Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2024085Z E ^ 2025-05-07T20:33:31.2024583Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2025114Z 2025-05-07T20:33:31.2025822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2026407Z 2025-05-07T20:33:31.2026526Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2026955Z self=, 2025-05-07T20:33:31.2027384Z T=1, 2025-05-07T20:33:31.2027595Z D=7168, 2025-05-07T20:33:31.2027796Z scale_ub=None, 2025-05-07T20:33:31.2028028Z contiguous=True, 2025-05-07T20:33:31.2028267Z compiled=True, 2025-05-07T20:33:31.2028589Z ) 2025-05-07T20:33:31.2028918Z self = 2025-05-07T20:33:31.2029428Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:31.2029702Z 2025-05-07T20:33:31.2029790Z @given( 2025-05-07T20:33:31.2030026Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2030356Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2030681Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2031018Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2031358Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2031662Z ) 2025-05-07T20:33:31.2032030Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2032492Z def test_silu_mul_quant( 2025-05-07T20:33:31.2032746Z self, 2025-05-07T20:33:31.2032963Z T: int, 2025-05-07T20:33:31.2033172Z D: int, 2025-05-07T20:33:31.2033414Z scale_ub: Optional[float], 2025-05-07T20:33:31.2033713Z contiguous: bool, 2025-05-07T20:33:31.2033963Z compiled: bool, 2025-05-07T20:33:31.2034211Z ) -> None: 2025-05-07T20:33:31.2034483Z torch.manual_seed(2025) 2025-05-07T20:33:31.2034739Z 2025-05-07T20:33:31.2035035Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2035408Z 2025-05-07T20:33:31.2035623Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2035940Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2036273Z x = x_sign * x_clamp 2025-05-07T20:33:31.2036524Z x0 = x[:, :D] 2025-05-07T20:33:31.2036762Z x1 = x[:, D:] 2025-05-07T20:33:31.2036991Z 2025-05-07T20:33:31.2037191Z if contiguous: 2025-05-07T20:33:31.2037448Z x0 = x0.contiguous() 2025-05-07T20:33:31.2037729Z x1 = x1.contiguous() 2025-05-07T20:33:31.2037991Z 2025-05-07T20:33:31.2038196Z if scale_ub is not None: 2025-05-07T20:33:31.2038501Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2038861Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2039179Z ) 2025-05-07T20:33:31.2039385Z else: 2025-05-07T20:33:31.2039596Z scale_ub_tensor = None 2025-05-07T20:33:31.2039848Z 2025-05-07T20:33:31.2040089Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2040515Z op = silu_mul_quant 2025-05-07T20:33:31.2040790Z if compiled: 2025-05-07T20:33:31.2041066Z op = torch.compile(op) 2025-05-07T20:33:31.2041389Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2041682Z 2025-05-07T20:33:31.2041901Z y_fp8, y_scale = fn() 2025-05-07T20:33:31.2042281Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:31.2042591Z 2025-05-07T20:33:31.2042865Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2043236Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:31.2043624Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:31.2043964Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:31.2044362Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2044707Z 2025-05-07T20:33:31.2044923Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:31.2045143Z 2025-05-07T20:33:31.2045259Z moe/activation_test.py:126: 2025-05-07T20:33:31.2045594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2045969Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:31.2046331Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2047166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:31.2048033Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:31.2048623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2049363Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2050093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:31.2050870Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:31.2051656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:31.2052341Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:31.2052980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:31.2053540Z fn() 2025-05-07T20:33:31.2054077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:31.2054776Z self.fn.run( 2025-05-07T20:33:31.2055294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2055861Z kernel = self.compile( 2025-05-07T20:33:31.2056428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2057116Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2057535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2057777Z 2025-05-07T20:33:31.2057998Z self = 2025-05-07T20:33:31.2059143Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2060576Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099b2c0cc0>} 2025-05-07T20:33:31.2061996Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2063144Z context = 2025-05-07T20:33:31.2063449Z 2025-05-07T20:33:31.2063632Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2064214Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2064709Z module_map=module_map) 2025-05-07T20:33:31.2065097Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2065481Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:31.2065801Z E ^ 2025-05-07T20:33:31.2066293Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2066771Z 2025-05-07T20:33:31.2067222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2067771Z 2025-05-07T20:33:31.2067880Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2068325Z self=, 2025-05-07T20:33:31.2068755Z T=4096, 2025-05-07T20:33:31.2068958Z D=5120, 2025-05-07T20:33:31.2069149Z scale_ub=None, 2025-05-07T20:33:31.2069374Z contiguous=False, 2025-05-07T20:33:31.2069611Z compiled=False, 2025-05-07T20:33:31.2069817Z ) 2025-05-07T20:33:31.2070151Z self = 2025-05-07T20:33:31.2070721Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:31.2071012Z 2025-05-07T20:33:31.2071093Z @given( 2025-05-07T20:33:31.2071333Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2071653Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2071962Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2072308Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2072652Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2072953Z ) 2025-05-07T20:33:31.2073309Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2073775Z def test_silu_mul_quant( 2025-05-07T20:33:31.2074030Z self, 2025-05-07T20:33:31.2074227Z T: int, 2025-05-07T20:33:31.2074437Z D: int, 2025-05-07T20:33:31.2074662Z scale_ub: Optional[float], 2025-05-07T20:33:31.2074943Z contiguous: bool, 2025-05-07T20:33:31.2075199Z compiled: bool, 2025-05-07T20:33:31.2075426Z ) -> None: 2025-05-07T20:33:31.2075644Z torch.manual_seed(2025) 2025-05-07T20:33:31.2075892Z 2025-05-07T20:33:31.2076180Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2076541Z 2025-05-07T20:33:31.2076744Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2077051Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2077365Z x = x_sign * x_clamp 2025-05-07T20:33:31.2077627Z x0 = x[:, :D] 2025-05-07T20:33:31.2077854Z x1 = x[:, D:] 2025-05-07T20:33:31.2078078Z 2025-05-07T20:33:31.2078262Z if contiguous: 2025-05-07T20:33:31.2078505Z x0 = x0.contiguous() 2025-05-07T20:33:31.2078779Z x1 = x1.contiguous() 2025-05-07T20:33:31.2079028Z 2025-05-07T20:33:31.2079234Z if scale_ub is not None: 2025-05-07T20:33:31.2079520Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2079863Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2080194Z ) 2025-05-07T20:33:31.2080410Z else: 2025-05-07T20:33:31.2080629Z scale_ub_tensor = None 2025-05-07T20:33:31.2080909Z 2025-05-07T20:33:31.2081151Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2081471Z op = silu_mul_quant 2025-05-07T20:33:31.2081735Z if compiled: 2025-05-07T20:33:31.2081993Z op = torch.compile(op) 2025-05-07T20:33:31.2082351Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2082644Z 2025-05-07T20:33:31.2082855Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2083020Z 2025-05-07T20:33:31.2083136Z moe/activation_test.py:117: 2025-05-07T20:33:31.2083486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2083836Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2084140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2084908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2085730Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2086314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2087051Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2087761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2088339Z kernel = self.compile( 2025-05-07T20:33:31.2088931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2089635Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2090065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2090365Z 2025-05-07T20:33:31.2090586Z self = 2025-05-07T20:33:31.2091735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2093182Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f09a03b7240>} 2025-05-07T20:33:31.2094708Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2095813Z context = 2025-05-07T20:33:31.2096123Z 2025-05-07T20:33:31.2096314Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2096874Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2097367Z module_map=module_map) 2025-05-07T20:33:31.2097756Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2098135Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2098410Z E ^ 2025-05-07T20:33:31.2098904Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2099388Z 2025-05-07T20:33:31.2099838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2100387Z 2025-05-07T20:33:31.2100514Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2100947Z self=, 2025-05-07T20:33:31.2101377Z T=4096, 2025-05-07T20:33:31.2101585Z D=7168, 2025-05-07T20:33:31.2101785Z scale_ub=None, 2025-05-07T20:33:31.2102018Z contiguous=False, 2025-05-07T20:33:31.2102250Z compiled=False, 2025-05-07T20:33:31.2102455Z ) 2025-05-07T20:33:31.2102791Z self = 2025-05-07T20:33:31.2103312Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:31.2103601Z 2025-05-07T20:33:31.2103688Z @given( 2025-05-07T20:33:31.2103968Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2104286Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2104643Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2104978Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2105365Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2105666Z ) 2025-05-07T20:33:31.2106020Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2106478Z def test_silu_mul_quant( 2025-05-07T20:33:31.2106761Z self, 2025-05-07T20:33:31.2106950Z T: int, 2025-05-07T20:33:31.2107153Z D: int, 2025-05-07T20:33:31.2107370Z scale_ub: Optional[float], 2025-05-07T20:33:31.2107645Z contiguous: bool, 2025-05-07T20:33:31.2107878Z compiled: bool, 2025-05-07T20:33:31.2108098Z ) -> None: 2025-05-07T20:33:31.2108314Z torch.manual_seed(2025) 2025-05-07T20:33:31.2108551Z 2025-05-07T20:33:31.2108835Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2109192Z 2025-05-07T20:33:31.2109376Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2109665Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2109984Z x = x_sign * x_clamp 2025-05-07T20:33:31.2110220Z x0 = x[:, :D] 2025-05-07T20:33:31.2110434Z x1 = x[:, D:] 2025-05-07T20:33:31.2110655Z 2025-05-07T20:33:31.2110887Z if contiguous: 2025-05-07T20:33:31.2111128Z x0 = x0.contiguous() 2025-05-07T20:33:31.2111401Z x1 = x1.contiguous() 2025-05-07T20:33:31.2111639Z 2025-05-07T20:33:31.2111842Z if scale_ub is not None: 2025-05-07T20:33:31.2112133Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2112477Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2112802Z ) 2025-05-07T20:33:31.2113006Z else: 2025-05-07T20:33:31.2113233Z scale_ub_tensor = None 2025-05-07T20:33:31.2113491Z 2025-05-07T20:33:31.2113735Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2114067Z op = silu_mul_quant 2025-05-07T20:33:31.2114318Z if compiled: 2025-05-07T20:33:31.2114575Z op = torch.compile(op) 2025-05-07T20:33:31.2114880Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2115158Z 2025-05-07T20:33:31.2115367Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2115540Z 2025-05-07T20:33:31.2115651Z moe/activation_test.py:117: 2025-05-07T20:33:31.2115960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2116311Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2116600Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2117327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2118056Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2118617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2119343Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2120042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2120605Z kernel = self.compile( 2025-05-07T20:33:31.2121174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2121880Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2122286Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2122538Z 2025-05-07T20:33:31.2122749Z self = 2025-05-07T20:33:31.2123938Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2125590Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099a879ee0>} 2025-05-07T20:33:31.2127093Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2128269Z context = 2025-05-07T20:33:31.2128577Z 2025-05-07T20:33:31.2128747Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2129292Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2129405Z module_map=module_map) 2025-05-07T20:33:31.2129571Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2129678Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2129756Z E ^ 2025-05-07T20:33:31.2130139Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2130144Z 2025-05-07T20:33:31.2130636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2130644Z 2025-05-07T20:33:31.2136835Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2137091Z self=, 2025-05-07T20:33:31.2137171Z T=128, 2025-05-07T20:33:31.2137247Z D=7168, 2025-05-07T20:33:31.2137330Z scale_ub=None, 2025-05-07T20:33:31.2137417Z contiguous=False, 2025-05-07T20:33:31.2137506Z compiled=True, 2025-05-07T20:33:31.2137582Z ) 2025-05-07T20:33:31.2137811Z self = 2025-05-07T20:33:31.2137985Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:31.2137990Z 2025-05-07T20:33:31.2138072Z @given( 2025-05-07T20:33:31.2138194Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2138293Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2138412Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2138531Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2138644Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2138719Z ) 2025-05-07T20:33:31.2138971Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2139067Z def test_silu_mul_quant( 2025-05-07T20:33:31.2139145Z self, 2025-05-07T20:33:31.2139227Z T: int, 2025-05-07T20:33:31.2139307Z D: int, 2025-05-07T20:33:31.2139403Z scale_ub: Optional[float], 2025-05-07T20:33:31.2139490Z contiguous: bool, 2025-05-07T20:33:31.2139576Z compiled: bool, 2025-05-07T20:33:31.2139653Z ) -> None: 2025-05-07T20:33:31.2139747Z torch.manual_seed(2025) 2025-05-07T20:33:31.2139825Z 2025-05-07T20:33:31.2139995Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2140068Z 2025-05-07T20:33:31.2140166Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2140292Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2140386Z x = x_sign * x_clamp 2025-05-07T20:33:31.2140474Z x0 = x[:, :D] 2025-05-07T20:33:31.2140556Z x1 = x[:, D:] 2025-05-07T20:33:31.2140627Z 2025-05-07T20:33:31.2140715Z if contiguous: 2025-05-07T20:33:31.2140804Z x0 = x0.contiguous() 2025-05-07T20:33:31.2140894Z x1 = x1.contiguous() 2025-05-07T20:33:31.2141070Z 2025-05-07T20:33:31.2141160Z if scale_ub is not None: 2025-05-07T20:33:31.2141272Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2141406Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2141481Z ) 2025-05-07T20:33:31.2141558Z else: 2025-05-07T20:33:31.2141713Z scale_ub_tensor = None 2025-05-07T20:33:31.2141785Z 2025-05-07T20:33:31.2141919Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2142012Z op = silu_mul_quant 2025-05-07T20:33:31.2142137Z if compiled: 2025-05-07T20:33:31.2142244Z op = torch.compile(op) 2025-05-07T20:33:31.2142349Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2142426Z 2025-05-07T20:33:31.2142518Z y_fp8, y_scale = fn() 2025-05-07T20:33:31.2142640Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:31.2142716Z 2025-05-07T20:33:31.2142849Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2142952Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:31.2143058Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:31.2143176Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:31.2143317Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2143395Z 2025-05-07T20:33:31.2143493Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:31.2143498Z 2025-05-07T20:33:31.2143733Z moe/activation_test.py:126: 2025-05-07T20:33:31.2143869Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2143979Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:31.2144114Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2144754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:31.2144858Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:31.2145248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2145475Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2145871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:31.2146134Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:31.2146532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:31.2146708Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:31.2147065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:31.2147141Z fn() 2025-05-07T20:33:31.2147566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:31.2147651Z self.fn.run( 2025-05-07T20:33:31.2148011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2148109Z kernel = self.compile( 2025-05-07T20:33:31.2148510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2148693Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2148825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2148830Z 2025-05-07T20:33:31.2149040Z self = 2025-05-07T20:33:31.2149855Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2150422Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099a7f9120>} 2025-05-07T20:33:31.2151280Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2151478Z context = 2025-05-07T20:33:31.2151520Z 2025-05-07T20:33:31.2151695Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2151972Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2152084Z module_map=module_map) 2025-05-07T20:33:31.2152259Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2152368Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:31.2152456Z E ^ 2025-05-07T20:33:31.2152827Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2152832Z 2025-05-07T20:33:31.2153270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2153275Z 2025-05-07T20:33:31.2153433Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2153662Z self=, 2025-05-07T20:33:31.2153755Z T=128, 2025-05-07T20:33:31.2153837Z D=7168, 2025-05-07T20:33:31.2153923Z scale_ub=None, 2025-05-07T20:33:31.2154021Z contiguous=False, 2025-05-07T20:33:31.2154108Z compiled=False, 2025-05-07T20:33:31.2154186Z ) 2025-05-07T20:33:31.2154416Z self = 2025-05-07T20:33:31.2154598Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:31.2154603Z 2025-05-07T20:33:31.2154684Z @given( 2025-05-07T20:33:31.2154810Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2154912Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2155036Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2155160Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2155279Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2155372Z ) 2025-05-07T20:33:31.2155625Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2155721Z def test_silu_mul_quant( 2025-05-07T20:33:31.2155807Z self, 2025-05-07T20:33:31.2155888Z T: int, 2025-05-07T20:33:31.2155967Z D: int, 2025-05-07T20:33:31.2156075Z scale_ub: Optional[float], 2025-05-07T20:33:31.2156172Z contiguous: bool, 2025-05-07T20:33:31.2156269Z compiled: bool, 2025-05-07T20:33:31.2156354Z ) -> None: 2025-05-07T20:33:31.2156452Z torch.manual_seed(2025) 2025-05-07T20:33:31.2156526Z 2025-05-07T20:33:31.2156704Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2156781Z 2025-05-07T20:33:31.2156885Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2157011Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2157102Z x = x_sign * x_clamp 2025-05-07T20:33:31.2157187Z x0 = x[:, :D] 2025-05-07T20:33:31.2157271Z x1 = x[:, D:] 2025-05-07T20:33:31.2157346Z 2025-05-07T20:33:31.2157441Z if contiguous: 2025-05-07T20:33:31.2157535Z x0 = x0.contiguous() 2025-05-07T20:33:31.2157629Z x1 = x1.contiguous() 2025-05-07T20:33:31.2157712Z 2025-05-07T20:33:31.2157804Z if scale_ub is not None: 2025-05-07T20:33:31.2157911Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2158099Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2158182Z ) 2025-05-07T20:33:31.2158264Z else: 2025-05-07T20:33:31.2158360Z scale_ub_tensor = None 2025-05-07T20:33:31.2158436Z 2025-05-07T20:33:31.2158617Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2158711Z op = silu_mul_quant 2025-05-07T20:33:31.2158798Z if compiled: 2025-05-07T20:33:31.2158912Z op = torch.compile(op) 2025-05-07T20:33:31.2159021Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2159139Z 2025-05-07T20:33:31.2159239Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2159243Z 2025-05-07T20:33:31.2159342Z moe/activation_test.py:117: 2025-05-07T20:33:31.2159480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2159589Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2159692Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2160223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2160323Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2160704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2160936Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2161342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2161453Z kernel = self.compile( 2025-05-07T20:33:31.2161864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2162043Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2162184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2162191Z 2025-05-07T20:33:31.2162408Z self = 2025-05-07T20:33:31.2163234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2163760Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099a4b8b80>} 2025-05-07T20:33:31.2164562Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2164770Z context = 2025-05-07T20:33:31.2164774Z 2025-05-07T20:33:31.2164979Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2165282Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2165392Z module_map=module_map) 2025-05-07T20:33:31.2165561Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2165671Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2165752Z E ^ 2025-05-07T20:33:31.2166130Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2166141Z 2025-05-07T20:33:31.2166582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2166586Z 2025-05-07T20:33:31.2166695Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2166933Z self=, 2025-05-07T20:33:31.2167064Z T=4096, 2025-05-07T20:33:31.2167143Z D=5120, 2025-05-07T20:33:31.2167234Z scale_ub=1200.0, 2025-05-07T20:33:31.2167324Z contiguous=True, 2025-05-07T20:33:31.2167413Z compiled=False, 2025-05-07T20:33:31.2167497Z ) 2025-05-07T20:33:31.2167766Z self = 2025-05-07T20:33:31.2167955Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:31.2167960Z 2025-05-07T20:33:31.2168047Z @given( 2025-05-07T20:33:31.2168168Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2168313Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2168436Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2168557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2168681Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2168762Z ) 2025-05-07T20:33:31.2169014Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2169121Z def test_silu_mul_quant( 2025-05-07T20:33:31.2169202Z self, 2025-05-07T20:33:31.2169288Z T: int, 2025-05-07T20:33:31.2169370Z D: int, 2025-05-07T20:33:31.2169478Z scale_ub: Optional[float], 2025-05-07T20:33:31.2169580Z contiguous: bool, 2025-05-07T20:33:31.2169670Z compiled: bool, 2025-05-07T20:33:31.2169751Z ) -> None: 2025-05-07T20:33:31.2169855Z torch.manual_seed(2025) 2025-05-07T20:33:31.2169980Z 2025-05-07T20:33:31.2170159Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2170247Z 2025-05-07T20:33:31.2170340Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2170466Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2170560Z x = x_sign * x_clamp 2025-05-07T20:33:31.2170644Z x0 = x[:, :D] 2025-05-07T20:33:31.2170729Z x1 = x[:, D:] 2025-05-07T20:33:31.2170808Z 2025-05-07T20:33:31.2170894Z if contiguous: 2025-05-07T20:33:31.2170994Z x0 = x0.contiguous() 2025-05-07T20:33:31.2171087Z x1 = x1.contiguous() 2025-05-07T20:33:31.2171158Z 2025-05-07T20:33:31.2171260Z if scale_ub is not None: 2025-05-07T20:33:31.2171373Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2171505Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2171593Z ) 2025-05-07T20:33:31.2171676Z else: 2025-05-07T20:33:31.2171774Z scale_ub_tensor = None 2025-05-07T20:33:31.2171856Z 2025-05-07T20:33:31.2171986Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2172081Z op = silu_mul_quant 2025-05-07T20:33:31.2172171Z if compiled: 2025-05-07T20:33:31.2172273Z op = torch.compile(op) 2025-05-07T20:33:31.2172390Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2172468Z 2025-05-07T20:33:31.2172563Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2172568Z 2025-05-07T20:33:31.2172671Z moe/activation_test.py:117: 2025-05-07T20:33:31.2172803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2172907Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2173018Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2173547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2173659Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2174040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2174292Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2174784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2174879Z kernel = self.compile( 2025-05-07T20:33:31.2175333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2175513Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2175685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2175690Z 2025-05-07T20:33:31.2175902Z self = 2025-05-07T20:33:31.2176716Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2177275Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099a4b9b20>} 2025-05-07T20:33:31.2178073Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2178271Z context = 2025-05-07T20:33:31.2178276Z 2025-05-07T20:33:31.2178449Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2178761Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2178869Z module_map=module_map) 2025-05-07T20:33:31.2179036Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2179138Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2179218Z E ^ 2025-05-07T20:33:31.2179587Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2179592Z 2025-05-07T20:33:31.2180032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2180037Z 2025-05-07T20:33:31.2180147Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2180379Z self=, 2025-05-07T20:33:31.2180462Z T=1, 2025-05-07T20:33:31.2180540Z D=5120, 2025-05-07T20:33:31.2180625Z scale_ub=None, 2025-05-07T20:33:31.2180720Z contiguous=True, 2025-05-07T20:33:31.2180809Z compiled=True, 2025-05-07T20:33:31.2180887Z ) 2025-05-07T20:33:31.2181119Z self = 2025-05-07T20:33:31.2181284Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:31.2181289Z 2025-05-07T20:33:31.2181369Z @given( 2025-05-07T20:33:31.2181497Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2181600Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2181730Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2181847Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2181959Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2182042Z ) 2025-05-07T20:33:31.2182299Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2182396Z def test_silu_mul_quant( 2025-05-07T20:33:31.2182486Z self, 2025-05-07T20:33:31.2182566Z T: int, 2025-05-07T20:33:31.2182646Z D: int, 2025-05-07T20:33:31.2182755Z scale_ub: Optional[float], 2025-05-07T20:33:31.2182852Z contiguous: bool, 2025-05-07T20:33:31.2182943Z compiled: bool, 2025-05-07T20:33:31.2183029Z ) -> None: 2025-05-07T20:33:31.2183121Z torch.manual_seed(2025) 2025-05-07T20:33:31.2183197Z 2025-05-07T20:33:31.2183367Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2183438Z 2025-05-07T20:33:31.2183606Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2183728Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2183817Z x = x_sign * x_clamp 2025-05-07T20:33:31.2183897Z x0 = x[:, :D] 2025-05-07T20:33:31.2183972Z x1 = x[:, D:] 2025-05-07T20:33:31.2184044Z 2025-05-07T20:33:31.2184175Z if contiguous: 2025-05-07T20:33:31.2184270Z x0 = x0.contiguous() 2025-05-07T20:33:31.2184355Z x1 = x1.contiguous() 2025-05-07T20:33:31.2184431Z 2025-05-07T20:33:31.2184522Z if scale_ub is not None: 2025-05-07T20:33:31.2184665Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2184805Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2184885Z ) 2025-05-07T20:33:31.2184970Z else: 2025-05-07T20:33:31.2185064Z scale_ub_tensor = None 2025-05-07T20:33:31.2185137Z 2025-05-07T20:33:31.2185269Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2185357Z op = silu_mul_quant 2025-05-07T20:33:31.2185438Z if compiled: 2025-05-07T20:33:31.2185535Z op = torch.compile(op) 2025-05-07T20:33:31.2185643Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2185713Z 2025-05-07T20:33:31.2185815Z y_fp8, y_scale = fn() 2025-05-07T20:33:31.2185931Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:31.2186003Z 2025-05-07T20:33:31.2186182Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2186288Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:31.2186396Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:31.2186518Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:31.2186656Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2186739Z 2025-05-07T20:33:31.2186837Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:31.2186844Z 2025-05-07T20:33:31.2186937Z moe/activation_test.py:126: 2025-05-07T20:33:31.2187074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2187175Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:31.2187313Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2187905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:31.2188011Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:31.2188405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2188632Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2189015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:31.2189283Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:31.2189678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:31.2189849Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:31.2190207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:31.2190278Z fn() 2025-05-07T20:33:31.2190707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:31.2190787Z self.fn.run( 2025-05-07T20:33:31.2191149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2191240Z kernel = self.compile( 2025-05-07T20:33:31.2191639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2191877Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2192008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2192013Z 2025-05-07T20:33:31.2192225Z self = 2025-05-07T20:33:31.2193088Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2193640Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099a4baca0>} 2025-05-07T20:33:31.2194439Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2194635Z context = 2025-05-07T20:33:31.2194639Z 2025-05-07T20:33:31.2194817Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2195096Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2195205Z module_map=module_map) 2025-05-07T20:33:31.2195411Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2195520Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:31.2195603Z E ^ 2025-05-07T20:33:31.2195983Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2195987Z 2025-05-07T20:33:31.2196425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2196433Z 2025-05-07T20:33:31.2196542Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2196771Z self=, 2025-05-07T20:33:31.2196850Z T=2048, 2025-05-07T20:33:31.2196934Z D=5120, 2025-05-07T20:33:31.2197019Z scale_ub=None, 2025-05-07T20:33:31.2197111Z contiguous=True, 2025-05-07T20:33:31.2197201Z compiled=True, 2025-05-07T20:33:31.2197276Z ) 2025-05-07T20:33:31.2197509Z self = 2025-05-07T20:33:31.2197688Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:31.2197697Z 2025-05-07T20:33:31.2197775Z @given( 2025-05-07T20:33:31.2197897Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2198007Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2198123Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2198248Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2198369Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2198449Z ) 2025-05-07T20:33:31.2198702Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2198805Z def test_silu_mul_quant( 2025-05-07T20:33:31.2198888Z self, 2025-05-07T20:33:31.2198970Z T: int, 2025-05-07T20:33:31.2199050Z D: int, 2025-05-07T20:33:31.2199151Z scale_ub: Optional[float], 2025-05-07T20:33:31.2199245Z contiguous: bool, 2025-05-07T20:33:31.2199335Z compiled: bool, 2025-05-07T20:33:31.2199423Z ) -> None: 2025-05-07T20:33:31.2199516Z torch.manual_seed(2025) 2025-05-07T20:33:31.2199592Z 2025-05-07T20:33:31.2199760Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2199830Z 2025-05-07T20:33:31.2199923Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2200043Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2200178Z x = x_sign * x_clamp 2025-05-07T20:33:31.2200255Z x0 = x[:, :D] 2025-05-07T20:33:31.2200329Z x1 = x[:, D:] 2025-05-07T20:33:31.2200405Z 2025-05-07T20:33:31.2200491Z if contiguous: 2025-05-07T20:33:31.2200578Z x0 = x0.contiguous() 2025-05-07T20:33:31.2200704Z x1 = x1.contiguous() 2025-05-07T20:33:31.2200784Z 2025-05-07T20:33:31.2200873Z if scale_ub is not None: 2025-05-07T20:33:31.2200976Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2201115Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2201234Z ) 2025-05-07T20:33:31.2201312Z else: 2025-05-07T20:33:31.2201405Z scale_ub_tensor = None 2025-05-07T20:33:31.2201478Z 2025-05-07T20:33:31.2201608Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2201699Z op = silu_mul_quant 2025-05-07T20:33:31.2201790Z if compiled: 2025-05-07T20:33:31.2201890Z op = torch.compile(op) 2025-05-07T20:33:31.2201994Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2202071Z 2025-05-07T20:33:31.2202162Z y_fp8, y_scale = fn() 2025-05-07T20:33:31.2202281Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:31.2202354Z 2025-05-07T20:33:31.2202502Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2202606Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:31.2202760Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:31.2202884Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:31.2203031Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2203110Z 2025-05-07T20:33:31.2203207Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:31.2203211Z 2025-05-07T20:33:31.2203306Z moe/activation_test.py:126: 2025-05-07T20:33:31.2203444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2203553Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:31.2203697Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2204304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:31.2204418Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:31.2204828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2205053Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2205445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:31.2205711Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:31.2206104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:31.2206279Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:31.2206642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:31.2206718Z fn() 2025-05-07T20:33:31.2207150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:31.2207234Z self.fn.run( 2025-05-07T20:33:31.2207598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2207688Z kernel = self.compile( 2025-05-07T20:33:31.2208087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2208271Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2208400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2208450Z 2025-05-07T20:33:31.2208657Z self = 2025-05-07T20:33:31.2209517Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2210035Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099a565e40>} 2025-05-07T20:33:31.2210876Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2211069Z context = 2025-05-07T20:33:31.2211076Z 2025-05-07T20:33:31.2211261Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2211538Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2211646Z module_map=module_map) 2025-05-07T20:33:31.2211824Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2211926Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:31.2212002Z E ^ 2025-05-07T20:33:31.2212453Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2212461Z 2025-05-07T20:33:31.2212901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2212906Z 2025-05-07T20:33:31.2213020Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2213249Z self=, 2025-05-07T20:33:31.2213328Z T=128, 2025-05-07T20:33:31.2213408Z D=5120, 2025-05-07T20:33:31.2213492Z scale_ub=None, 2025-05-07T20:33:31.2213576Z contiguous=True, 2025-05-07T20:33:31.2213662Z compiled=True, 2025-05-07T20:33:31.2213740Z ) 2025-05-07T20:33:31.2213966Z self = 2025-05-07T20:33:31.2214145Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:31.2214150Z 2025-05-07T20:33:31.2214234Z @given( 2025-05-07T20:33:31.2214364Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2214555Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2214672Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2214803Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2214917Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2214990Z ) 2025-05-07T20:33:31.2215250Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2215347Z def test_silu_mul_quant( 2025-05-07T20:33:31.2215430Z self, 2025-05-07T20:33:31.2215504Z T: int, 2025-05-07T20:33:31.2215580Z D: int, 2025-05-07T20:33:31.2215678Z scale_ub: Optional[float], 2025-05-07T20:33:31.2215766Z contiguous: bool, 2025-05-07T20:33:31.2215851Z compiled: bool, 2025-05-07T20:33:31.2215927Z ) -> None: 2025-05-07T20:33:31.2216022Z torch.manual_seed(2025) 2025-05-07T20:33:31.2216093Z 2025-05-07T20:33:31.2216275Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2216346Z 2025-05-07T20:33:31.2216436Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2216564Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2216651Z x = x_sign * x_clamp 2025-05-07T20:33:31.2216736Z x0 = x[:, :D] 2025-05-07T20:33:31.2216812Z x1 = x[:, D:] 2025-05-07T20:33:31.2216933Z 2025-05-07T20:33:31.2217014Z if contiguous: 2025-05-07T20:33:31.2217102Z x0 = x0.contiguous() 2025-05-07T20:33:31.2217186Z x1 = x1.contiguous() 2025-05-07T20:33:31.2217258Z 2025-05-07T20:33:31.2217348Z if scale_ub is not None: 2025-05-07T20:33:31.2217492Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2217631Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2217702Z ) 2025-05-07T20:33:31.2217781Z else: 2025-05-07T20:33:31.2217877Z scale_ub_tensor = None 2025-05-07T20:33:31.2217991Z 2025-05-07T20:33:31.2218118Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2218212Z op = silu_mul_quant 2025-05-07T20:33:31.2218297Z if compiled: 2025-05-07T20:33:31.2218397Z op = torch.compile(op) 2025-05-07T20:33:31.2218501Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2218570Z 2025-05-07T20:33:31.2218668Z y_fp8, y_scale = fn() 2025-05-07T20:33:31.2218787Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:31.2218858Z 2025-05-07T20:33:31.2218996Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2219097Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:31.2219197Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:31.2219322Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:31.2219504Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2219581Z 2025-05-07T20:33:31.2219677Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:31.2219682Z 2025-05-07T20:33:31.2219775Z moe/activation_test.py:126: 2025-05-07T20:33:31.2219907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2220008Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:31.2220140Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2220744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:31.2220847Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:31.2221232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2221459Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2221844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:31.2222110Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:31.2222511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:31.2222678Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:31.2223039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:31.2223113Z fn() 2025-05-07T20:33:31.2223539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:31.2223620Z self.fn.run( 2025-05-07T20:33:31.2223973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2224073Z kernel = self.compile( 2025-05-07T20:33:31.2224471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2224648Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2224781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2224786Z 2025-05-07T20:33:31.2224991Z self = 2025-05-07T20:33:31.2226126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2226730Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e5902ac0>} 2025-05-07T20:33:31.2227534Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2227790Z context = 2025-05-07T20:33:31.2227795Z 2025-05-07T20:33:31.2227966Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2228244Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2228355Z module_map=module_map) 2025-05-07T20:33:31.2228521Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2228623Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:31.2228704Z E ^ 2025-05-07T20:33:31.2229079Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2229084Z 2025-05-07T20:33:31.2229587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2229594Z 2025-05-07T20:33:31.2229706Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2229936Z self=, 2025-05-07T20:33:31.2230013Z T=4096, 2025-05-07T20:33:31.2230092Z D=5120, 2025-05-07T20:33:31.2230178Z scale_ub=None, 2025-05-07T20:33:31.2230266Z contiguous=True, 2025-05-07T20:33:31.2230356Z compiled=True, 2025-05-07T20:33:31.2230430Z ) 2025-05-07T20:33:31.2230655Z self = 2025-05-07T20:33:31.2230835Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:31.2230843Z 2025-05-07T20:33:31.2230920Z @given( 2025-05-07T20:33:31.2231040Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2231145Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2231264Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2231389Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2231508Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2231584Z ) 2025-05-07T20:33:31.2231845Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2231942Z def test_silu_mul_quant( 2025-05-07T20:33:31.2232018Z self, 2025-05-07T20:33:31.2232102Z T: int, 2025-05-07T20:33:31.2232178Z D: int, 2025-05-07T20:33:31.2232280Z scale_ub: Optional[float], 2025-05-07T20:33:31.2232373Z contiguous: bool, 2025-05-07T20:33:31.2232458Z compiled: bool, 2025-05-07T20:33:31.2232535Z ) -> None: 2025-05-07T20:33:31.2232636Z torch.manual_seed(2025) 2025-05-07T20:33:31.2232707Z 2025-05-07T20:33:31.2232883Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2232959Z 2025-05-07T20:33:31.2233051Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2233190Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2233278Z x = x_sign * x_clamp 2025-05-07T20:33:31.2233360Z x0 = x[:, :D] 2025-05-07T20:33:31.2233446Z x1 = x[:, D:] 2025-05-07T20:33:31.2233521Z 2025-05-07T20:33:31.2233605Z if contiguous: 2025-05-07T20:33:31.2233702Z x0 = x0.contiguous() 2025-05-07T20:33:31.2233862Z x1 = x1.contiguous() 2025-05-07T20:33:31.2233940Z 2025-05-07T20:33:31.2234034Z if scale_ub is not None: 2025-05-07T20:33:31.2234141Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2234302Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2234386Z ) 2025-05-07T20:33:31.2234521Z else: 2025-05-07T20:33:31.2234624Z scale_ub_tensor = None 2025-05-07T20:33:31.2234699Z 2025-05-07T20:33:31.2234837Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2234931Z op = silu_mul_quant 2025-05-07T20:33:31.2235056Z if compiled: 2025-05-07T20:33:31.2235154Z op = torch.compile(op) 2025-05-07T20:33:31.2235264Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2235337Z 2025-05-07T20:33:31.2235428Z y_fp8, y_scale = fn() 2025-05-07T20:33:31.2235552Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:31.2235624Z 2025-05-07T20:33:31.2235767Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2235868Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:31.2235968Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:31.2236094Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:31.2236240Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2236312Z 2025-05-07T20:33:31.2236462Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:31.2236467Z 2025-05-07T20:33:31.2236566Z moe/activation_test.py:126: 2025-05-07T20:33:31.2236703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2236818Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:31.2236950Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2237552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:31.2237659Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:31.2238039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2238277Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2238664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:31.2238938Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:31.2239334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:31.2239502Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:31.2239863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:31.2239941Z fn() 2025-05-07T20:33:31.2240367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:31.2240453Z self.fn.run( 2025-05-07T20:33:31.2240809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2240905Z kernel = self.compile( 2025-05-07T20:33:31.2241309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2241484Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2241619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2241624Z 2025-05-07T20:33:31.2241830Z self = 2025-05-07T20:33:31.2242640Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2243206Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e5d734c0>} 2025-05-07T20:33:31.2244040Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2244302Z context = 2025-05-07T20:33:31.2244306Z 2025-05-07T20:33:31.2244477Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2244760Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2244870Z module_map=module_map) 2025-05-07T20:33:31.2245039Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2245151Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:31.2245230Z E ^ 2025-05-07T20:33:31.2245600Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2245616Z 2025-05-07T20:33:31.2246053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2246099Z 2025-05-07T20:33:31.2246208Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2246451Z self=, 2025-05-07T20:33:31.2246529Z T=16384, 2025-05-07T20:33:31.2246608Z D=5120, 2025-05-07T20:33:31.2246698Z scale_ub=None, 2025-05-07T20:33:31.2246784Z contiguous=True, 2025-05-07T20:33:31.2246866Z compiled=True, 2025-05-07T20:33:31.2246945Z ) 2025-05-07T20:33:31.2247169Z self = 2025-05-07T20:33:31.2247353Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:31.2247357Z 2025-05-07T20:33:31.2247436Z @given( 2025-05-07T20:33:31.2247554Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2247665Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2247781Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2247903Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2248025Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2248105Z ) 2025-05-07T20:33:31.2248362Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2248456Z def test_silu_mul_quant( 2025-05-07T20:33:31.2248537Z self, 2025-05-07T20:33:31.2248615Z T: int, 2025-05-07T20:33:31.2248693Z D: int, 2025-05-07T20:33:31.2248794Z scale_ub: Optional[float], 2025-05-07T20:33:31.2248892Z contiguous: bool, 2025-05-07T20:33:31.2248978Z compiled: bool, 2025-05-07T20:33:31.2249055Z ) -> None: 2025-05-07T20:33:31.2249159Z torch.manual_seed(2025) 2025-05-07T20:33:31.2249230Z 2025-05-07T20:33:31.2249407Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2249493Z 2025-05-07T20:33:31.2249588Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2249716Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2249812Z x = x_sign * x_clamp 2025-05-07T20:33:31.2249897Z x0 = x[:, :D] 2025-05-07T20:33:31.2249982Z x1 = x[:, D:] 2025-05-07T20:33:31.2250055Z 2025-05-07T20:33:31.2250141Z if contiguous: 2025-05-07T20:33:31.2250238Z x0 = x0.contiguous() 2025-05-07T20:33:31.2250329Z x1 = x1.contiguous() 2025-05-07T20:33:31.2250407Z 2025-05-07T20:33:31.2250508Z if scale_ub is not None: 2025-05-07T20:33:31.2250666Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2250802Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2250887Z ) 2025-05-07T20:33:31.2250964Z else: 2025-05-07T20:33:31.2251059Z scale_ub_tensor = None 2025-05-07T20:33:31.2251136Z 2025-05-07T20:33:31.2251307Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2251403Z op = silu_mul_quant 2025-05-07T20:33:31.2251488Z if compiled: 2025-05-07T20:33:31.2251592Z op = torch.compile(op) 2025-05-07T20:33:31.2251744Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2251817Z 2025-05-07T20:33:31.2251914Z y_fp8, y_scale = fn() 2025-05-07T20:33:31.2252045Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:31.2252124Z 2025-05-07T20:33:31.2252261Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2252371Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:31.2252475Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:31.2252597Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:31.2252743Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2252819Z 2025-05-07T20:33:31.2252929Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:31.2252933Z 2025-05-07T20:33:31.2253035Z moe/activation_test.py:126: 2025-05-07T20:33:31.2253213Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2253329Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:31.2253468Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2254061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:31.2254171Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:31.2254673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2254915Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2255306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:31.2255570Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:31.2255974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:31.2256151Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:31.2256513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:31.2256592Z fn() 2025-05-07T20:33:31.2257014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:31.2257112Z self.fn.run( 2025-05-07T20:33:31.2257471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2257569Z kernel = self.compile( 2025-05-07T20:33:31.2257978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2258161Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2258303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2258311Z 2025-05-07T20:33:31.2258525Z self = 2025-05-07T20:33:31.2259338Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2259905Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e52b9580>} 2025-05-07T20:33:31.2260735Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2260935Z context = 2025-05-07T20:33:31.2260941Z 2025-05-07T20:33:31.2261111Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2261425Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2261535Z module_map=module_map) 2025-05-07T20:33:31.2261698Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2261815Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:31.2261898Z E ^ 2025-05-07T20:33:31.2262269Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2262273Z 2025-05-07T20:33:31.2262725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2262730Z 2025-05-07T20:33:31.2262833Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2263120Z self=, 2025-05-07T20:33:31.2263206Z T=1, 2025-05-07T20:33:31.2267325Z D=5120, 2025-05-07T20:33:31.2267430Z scale_ub=1200.0, 2025-05-07T20:33:31.2267516Z contiguous=True, 2025-05-07T20:33:31.2267603Z compiled=True, 2025-05-07T20:33:31.2267680Z ) 2025-05-07T20:33:31.2267910Z self = 2025-05-07T20:33:31.2268077Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:31.2268087Z 2025-05-07T20:33:31.2268174Z @given( 2025-05-07T20:33:31.2268295Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2268396Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2268517Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2268635Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2268754Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2268828Z ) 2025-05-07T20:33:31.2269085Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2269188Z def test_silu_mul_quant( 2025-05-07T20:33:31.2269264Z self, 2025-05-07T20:33:31.2269342Z T: int, 2025-05-07T20:33:31.2269422Z D: int, 2025-05-07T20:33:31.2269518Z scale_ub: Optional[float], 2025-05-07T20:33:31.2269608Z contiguous: bool, 2025-05-07T20:33:31.2269692Z compiled: bool, 2025-05-07T20:33:31.2269767Z ) -> None: 2025-05-07T20:33:31.2269862Z torch.manual_seed(2025) 2025-05-07T20:33:31.2269934Z 2025-05-07T20:33:31.2270102Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2270175Z 2025-05-07T20:33:31.2270266Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2270392Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2270480Z x = x_sign * x_clamp 2025-05-07T20:33:31.2270556Z x0 = x[:, :D] 2025-05-07T20:33:31.2270637Z x1 = x[:, D:] 2025-05-07T20:33:31.2270712Z 2025-05-07T20:33:31.2270797Z if contiguous: 2025-05-07T20:33:31.2270884Z x0 = x0.contiguous() 2025-05-07T20:33:31.2270973Z x1 = x1.contiguous() 2025-05-07T20:33:31.2271043Z 2025-05-07T20:33:31.2271130Z if scale_ub is not None: 2025-05-07T20:33:31.2271234Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2271371Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2271519Z ) 2025-05-07T20:33:31.2271602Z else: 2025-05-07T20:33:31.2271698Z scale_ub_tensor = None 2025-05-07T20:33:31.2271773Z 2025-05-07T20:33:31.2271904Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2271995Z op = silu_mul_quant 2025-05-07T20:33:31.2272129Z if compiled: 2025-05-07T20:33:31.2272230Z op = torch.compile(op) 2025-05-07T20:33:31.2272337Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2272416Z 2025-05-07T20:33:31.2272508Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2272554Z 2025-05-07T20:33:31.2272655Z moe/activation_test.py:117: 2025-05-07T20:33:31.2272791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2272897Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2272999Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2273393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2273494Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2274020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2274125Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2274501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2274779Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2275143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2275239Z kernel = self.compile( 2025-05-07T20:33:31.2275647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2275833Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2275974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2275978Z 2025-05-07T20:33:31.2276187Z self = 2025-05-07T20:33:31.2277008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2277531Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e500c680>} 2025-05-07T20:33:31.2278335Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2278537Z context = 2025-05-07T20:33:31.2278544Z 2025-05-07T20:33:31.2278718Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2278993Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2279107Z module_map=module_map) 2025-05-07T20:33:31.2279274Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2279379Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2279456Z E ^ 2025-05-07T20:33:31.2279831Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2279839Z 2025-05-07T20:33:31.2280277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2280282Z 2025-05-07T20:33:31.2280388Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2280673Z self=, 2025-05-07T20:33:31.2280751Z T=1, 2025-05-07T20:33:31.2280830Z D=5120, 2025-05-07T20:33:31.2280915Z scale_ub=None, 2025-05-07T20:33:31.2281003Z contiguous=False, 2025-05-07T20:33:31.2281089Z compiled=True, 2025-05-07T20:33:31.2281167Z ) 2025-05-07T20:33:31.2281459Z self = 2025-05-07T20:33:31.2281634Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:31.2281638Z 2025-05-07T20:33:31.2281851Z @given( 2025-05-07T20:33:31.2281972Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2282085Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2282200Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2282316Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2282434Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2282511Z ) 2025-05-07T20:33:31.2282772Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2282865Z def test_silu_mul_quant( 2025-05-07T20:33:31.2282940Z self, 2025-05-07T20:33:31.2283023Z T: int, 2025-05-07T20:33:31.2283098Z D: int, 2025-05-07T20:33:31.2283203Z scale_ub: Optional[float], 2025-05-07T20:33:31.2283304Z contiguous: bool, 2025-05-07T20:33:31.2283392Z compiled: bool, 2025-05-07T20:33:31.2283514Z ) -> None: 2025-05-07T20:33:31.2283621Z torch.manual_seed(2025) 2025-05-07T20:33:31.2283695Z 2025-05-07T20:33:31.2283867Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2283945Z 2025-05-07T20:33:31.2284039Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2284168Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2284261Z x = x_sign * x_clamp 2025-05-07T20:33:31.2284340Z x0 = x[:, :D] 2025-05-07T20:33:31.2284433Z x1 = x[:, D:] 2025-05-07T20:33:31.2284506Z 2025-05-07T20:33:31.2284591Z if contiguous: 2025-05-07T20:33:31.2284687Z x0 = x0.contiguous() 2025-05-07T20:33:31.2284776Z x1 = x1.contiguous() 2025-05-07T20:33:31.2284855Z 2025-05-07T20:33:31.2284973Z if scale_ub is not None: 2025-05-07T20:33:31.2285101Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2285243Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2285327Z ) 2025-05-07T20:33:31.2285406Z else: 2025-05-07T20:33:31.2285503Z scale_ub_tensor = None 2025-05-07T20:33:31.2285579Z 2025-05-07T20:33:31.2285709Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2285807Z op = silu_mul_quant 2025-05-07T20:33:31.2285891Z if compiled: 2025-05-07T20:33:31.2285989Z op = torch.compile(op) 2025-05-07T20:33:31.2286098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2286176Z 2025-05-07T20:33:31.2286268Z y_fp8, y_scale = fn() 2025-05-07T20:33:31.2286395Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:31.2286468Z 2025-05-07T20:33:31.2286609Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2286720Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:31.2286821Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:31.2286947Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:31.2287095Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2287174Z 2025-05-07T20:33:31.2287279Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:31.2287284Z 2025-05-07T20:33:31.2287381Z moe/activation_test.py:126: 2025-05-07T20:33:31.2287514Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2287625Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:31.2287808Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2288399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:31.2288506Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:31.2288925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2289169Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2289604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:31.2289872Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:31.2290277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:31.2290454Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:31.2290823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:31.2290897Z fn() 2025-05-07T20:33:31.2291323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:31.2291415Z self.fn.run( 2025-05-07T20:33:31.2291811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2291914Z kernel = self.compile( 2025-05-07T20:33:31.2292325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2292511Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2292651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2292656Z 2025-05-07T20:33:31.2292872Z self = 2025-05-07T20:33:31.2293698Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2294227Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e500eb60>} 2025-05-07T20:33:31.2295129Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2295360Z context = 2025-05-07T20:33:31.2295365Z 2025-05-07T20:33:31.2295535Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2295814Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2295926Z module_map=module_map) 2025-05-07T20:33:31.2296088Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2296201Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:31.2296280Z E ^ 2025-05-07T20:33:31.2296655Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2296660Z 2025-05-07T20:33:31.2297105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2297110Z 2025-05-07T20:33:31.2297217Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2297449Z self=, 2025-05-07T20:33:31.2297525Z T=1, 2025-05-07T20:33:31.2297603Z D=5120, 2025-05-07T20:33:31.2297735Z scale_ub=None, 2025-05-07T20:33:31.2297819Z contiguous=True, 2025-05-07T20:33:31.2297904Z compiled=False, 2025-05-07T20:33:31.2297989Z ) 2025-05-07T20:33:31.2298214Z self = 2025-05-07T20:33:31.2298419Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:31.2298424Z 2025-05-07T20:33:31.2298507Z @given( 2025-05-07T20:33:31.2298632Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2298737Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2298895Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2299014Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2299134Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2299212Z ) 2025-05-07T20:33:31.2299471Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2299572Z def test_silu_mul_quant( 2025-05-07T20:33:31.2299650Z self, 2025-05-07T20:33:31.2299729Z T: int, 2025-05-07T20:33:31.2299808Z D: int, 2025-05-07T20:33:31.2299908Z scale_ub: Optional[float], 2025-05-07T20:33:31.2300003Z contiguous: bool, 2025-05-07T20:33:31.2300093Z compiled: bool, 2025-05-07T20:33:31.2300179Z ) -> None: 2025-05-07T20:33:31.2300277Z torch.manual_seed(2025) 2025-05-07T20:33:31.2300349Z 2025-05-07T20:33:31.2300566Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2300651Z 2025-05-07T20:33:31.2300743Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2300868Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2300961Z x = x_sign * x_clamp 2025-05-07T20:33:31.2301040Z x0 = x[:, :D] 2025-05-07T20:33:31.2301125Z x1 = x[:, D:] 2025-05-07T20:33:31.2301207Z 2025-05-07T20:33:31.2301290Z if contiguous: 2025-05-07T20:33:31.2301386Z x0 = x0.contiguous() 2025-05-07T20:33:31.2301486Z x1 = x1.contiguous() 2025-05-07T20:33:31.2301560Z 2025-05-07T20:33:31.2301656Z if scale_ub is not None: 2025-05-07T20:33:31.2301760Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2301903Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2301981Z ) 2025-05-07T20:33:31.2302059Z else: 2025-05-07T20:33:31.2302154Z scale_ub_tensor = None 2025-05-07T20:33:31.2302234Z 2025-05-07T20:33:31.2302365Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2302464Z op = silu_mul_quant 2025-05-07T20:33:31.2302550Z if compiled: 2025-05-07T20:33:31.2302650Z op = torch.compile(op) 2025-05-07T20:33:31.2302756Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2302836Z 2025-05-07T20:33:31.2302932Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2302936Z 2025-05-07T20:33:31.2303041Z moe/activation_test.py:117: 2025-05-07T20:33:31.2303180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2303281Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2303384Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2303914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2304012Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2304407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2304684Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2305046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2305140Z kernel = self.compile( 2025-05-07T20:33:31.2305545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2305779Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2305910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2305915Z 2025-05-07T20:33:31.2306169Z self = 2025-05-07T20:33:31.2306989Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2307545Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e500f9c0>} 2025-05-07T20:33:31.2308345Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2308548Z context = 2025-05-07T20:33:31.2308553Z 2025-05-07T20:33:31.2308728Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2309000Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2309147Z module_map=module_map) 2025-05-07T20:33:31.2309314Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2309420Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2309496Z E ^ 2025-05-07T20:33:31.2309874Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2309879Z 2025-05-07T20:33:31.2310315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2310322Z 2025-05-07T20:33:31.2310431Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2310662Z self=, 2025-05-07T20:33:31.2310741Z T=128, 2025-05-07T20:33:31.2310822Z D=5120, 2025-05-07T20:33:31.2310907Z scale_ub=None, 2025-05-07T20:33:31.2310998Z contiguous=False, 2025-05-07T20:33:31.2311083Z compiled=True, 2025-05-07T20:33:31.2311160Z ) 2025-05-07T20:33:31.2311391Z self = 2025-05-07T20:33:31.2311569Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:31.2311573Z 2025-05-07T20:33:31.2311657Z @given( 2025-05-07T20:33:31.2311780Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2311881Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2311997Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2312127Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2312244Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2312328Z ) 2025-05-07T20:33:31.2312582Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2312680Z def test_silu_mul_quant( 2025-05-07T20:33:31.2312758Z self, 2025-05-07T20:33:31.2312841Z T: int, 2025-05-07T20:33:31.2312919Z D: int, 2025-05-07T20:33:31.2313026Z scale_ub: Optional[float], 2025-05-07T20:33:31.2313119Z contiguous: bool, 2025-05-07T20:33:31.2313210Z compiled: bool, 2025-05-07T20:33:31.2313300Z ) -> None: 2025-05-07T20:33:31.2313396Z torch.manual_seed(2025) 2025-05-07T20:33:31.2313473Z 2025-05-07T20:33:31.2313654Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2313728Z 2025-05-07T20:33:31.2313826Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2314025Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2314114Z x = x_sign * x_clamp 2025-05-07T20:33:31.2314201Z x0 = x[:, :D] 2025-05-07T20:33:31.2314278Z x1 = x[:, D:] 2025-05-07T20:33:31.2314354Z 2025-05-07T20:33:31.2314441Z if contiguous: 2025-05-07T20:33:31.2314573Z x0 = x0.contiguous() 2025-05-07T20:33:31.2314671Z x1 = x1.contiguous() 2025-05-07T20:33:31.2314768Z 2025-05-07T20:33:31.2314872Z if scale_ub is not None: 2025-05-07T20:33:31.2315005Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2315185Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2315259Z ) 2025-05-07T20:33:31.2315341Z else: 2025-05-07T20:33:31.2315441Z scale_ub_tensor = None 2025-05-07T20:33:31.2315516Z 2025-05-07T20:33:31.2315653Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2315743Z op = silu_mul_quant 2025-05-07T20:33:31.2315832Z if compiled: 2025-05-07T20:33:31.2315933Z op = torch.compile(op) 2025-05-07T20:33:31.2316039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2316111Z 2025-05-07T20:33:31.2316208Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2316213Z 2025-05-07T20:33:31.2316315Z moe/activation_test.py:117: 2025-05-07T20:33:31.2316448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2316594Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2316695Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2317089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2317188Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2317716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2317824Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2318201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2318431Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2318803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2318900Z kernel = self.compile( 2025-05-07T20:33:31.2319312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2319498Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2319630Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2319634Z 2025-05-07T20:33:31.2319850Z self = 2025-05-07T20:33:31.2320663Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2321187Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e500ca40>} 2025-05-07T20:33:31.2321986Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2322185Z context = 2025-05-07T20:33:31.2322190Z 2025-05-07T20:33:31.2322365Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2322638Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2322788Z module_map=module_map) 2025-05-07T20:33:31.2322950Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2323051Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2323134Z E ^ 2025-05-07T20:33:31.2323544Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2323549Z 2025-05-07T20:33:31.2324000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2324044Z 2025-05-07T20:33:31.2324151Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2324384Z self=, 2025-05-07T20:33:31.2324469Z T=128, 2025-05-07T20:33:31.2324548Z D=7168, 2025-05-07T20:33:31.2324630Z scale_ub=1200.0, 2025-05-07T20:33:31.2324714Z contiguous=False, 2025-05-07T20:33:31.2324813Z compiled=False, 2025-05-07T20:33:31.2324890Z ) 2025-05-07T20:33:31.2325144Z self = 2025-05-07T20:33:31.2325319Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:31.2325324Z 2025-05-07T20:33:31.2325589Z @given( 2025-05-07T20:33:31.2325769Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2325906Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2326118Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2326246Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2326373Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2326449Z ) 2025-05-07T20:33:31.2326737Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2326838Z def test_silu_mul_quant( 2025-05-07T20:33:31.2326918Z self, 2025-05-07T20:33:31.2326995Z T: int, 2025-05-07T20:33:31.2327077Z D: int, 2025-05-07T20:33:31.2327173Z scale_ub: Optional[float], 2025-05-07T20:33:31.2327259Z contiguous: bool, 2025-05-07T20:33:31.2327344Z compiled: bool, 2025-05-07T20:33:31.2327421Z ) -> None: 2025-05-07T20:33:31.2327513Z torch.manual_seed(2025) 2025-05-07T20:33:31.2327586Z 2025-05-07T20:33:31.2327762Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2327834Z 2025-05-07T20:33:31.2327927Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2328053Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2328146Z x = x_sign * x_clamp 2025-05-07T20:33:31.2328222Z x0 = x[:, :D] 2025-05-07T20:33:31.2328301Z x1 = x[:, D:] 2025-05-07T20:33:31.2328374Z 2025-05-07T20:33:31.2328455Z if contiguous: 2025-05-07T20:33:31.2328542Z x0 = x0.contiguous() 2025-05-07T20:33:31.2328630Z x1 = x1.contiguous() 2025-05-07T20:33:31.2328700Z 2025-05-07T20:33:31.2328793Z if scale_ub is not None: 2025-05-07T20:33:31.2328896Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2329026Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2329103Z ) 2025-05-07T20:33:31.2329174Z else: 2025-05-07T20:33:31.2329267Z scale_ub_tensor = None 2025-05-07T20:33:31.2329343Z 2025-05-07T20:33:31.2329470Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2329559Z op = silu_mul_quant 2025-05-07T20:33:31.2329641Z if compiled: 2025-05-07T20:33:31.2329739Z op = torch.compile(op) 2025-05-07T20:33:31.2329842Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2329913Z 2025-05-07T20:33:31.2330000Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2330004Z 2025-05-07T20:33:31.2330098Z moe/activation_test.py:117: 2025-05-07T20:33:31.2330231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2330397Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2330497Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2331020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2331174Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2331555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2331784Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2332204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2332299Z kernel = self.compile( 2025-05-07T20:33:31.2332703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2332883Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2333015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2333020Z 2025-05-07T20:33:31.2333232Z self = 2025-05-07T20:33:31.2334091Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2334674Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e5d34540>} 2025-05-07T20:33:31.2335471Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2335673Z context = 2025-05-07T20:33:31.2335677Z 2025-05-07T20:33:31.2335854Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2336130Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2336239Z module_map=module_map) 2025-05-07T20:33:31.2336406Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2336510Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2336591Z E ^ 2025-05-07T20:33:31.2336965Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2336969Z 2025-05-07T20:33:31.2337406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2337410Z 2025-05-07T20:33:31.2337519Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2337753Z self=, 2025-05-07T20:33:31.2337835Z T=128, 2025-05-07T20:33:31.2337916Z D=5120, 2025-05-07T20:33:31.2338000Z scale_ub=None, 2025-05-07T20:33:31.2338087Z contiguous=False, 2025-05-07T20:33:31.2338175Z compiled=False, 2025-05-07T20:33:31.2338250Z ) 2025-05-07T20:33:31.2338482Z self = 2025-05-07T20:33:31.2338660Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:31.2338666Z 2025-05-07T20:33:31.2338747Z @given( 2025-05-07T20:33:31.2338867Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2338965Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2339078Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2339195Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2339306Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2339425Z ) 2025-05-07T20:33:31.2339680Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2339772Z def test_silu_mul_quant( 2025-05-07T20:33:31.2339850Z self, 2025-05-07T20:33:31.2339924Z T: int, 2025-05-07T20:33:31.2340038Z D: int, 2025-05-07T20:33:31.2340137Z scale_ub: Optional[float], 2025-05-07T20:33:31.2340222Z contiguous: bool, 2025-05-07T20:33:31.2340305Z compiled: bool, 2025-05-07T20:33:31.2340386Z ) -> None: 2025-05-07T20:33:31.2340517Z torch.manual_seed(2025) 2025-05-07T20:33:31.2340588Z 2025-05-07T20:33:31.2340762Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2340834Z 2025-05-07T20:33:31.2340922Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2341046Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2341129Z x = x_sign * x_clamp 2025-05-07T20:33:31.2341215Z x0 = x[:, :D] 2025-05-07T20:33:31.2341290Z x1 = x[:, D:] 2025-05-07T20:33:31.2341357Z 2025-05-07T20:33:31.2341441Z if contiguous: 2025-05-07T20:33:31.2341529Z x0 = x0.contiguous() 2025-05-07T20:33:31.2341614Z x1 = x1.contiguous() 2025-05-07T20:33:31.2341689Z 2025-05-07T20:33:31.2341777Z if scale_ub is not None: 2025-05-07T20:33:31.2341877Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2342077Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2342154Z ) 2025-05-07T20:33:31.2342228Z else: 2025-05-07T20:33:31.2342326Z scale_ub_tensor = None 2025-05-07T20:33:31.2342394Z 2025-05-07T20:33:31.2342523Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2342614Z op = silu_mul_quant 2025-05-07T20:33:31.2342694Z if compiled: 2025-05-07T20:33:31.2342794Z op = torch.compile(op) 2025-05-07T20:33:31.2342898Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2342968Z 2025-05-07T20:33:31.2343058Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2343062Z 2025-05-07T20:33:31.2343155Z moe/activation_test.py:117: 2025-05-07T20:33:31.2343289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2343389Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2343487Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2344015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2344114Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2344487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2344718Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2345074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2345166Z kernel = self.compile( 2025-05-07T20:33:31.2345573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2345749Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2345879Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2345889Z 2025-05-07T20:33:31.2346093Z self = 2025-05-07T20:33:31.2346904Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2347419Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e50c0400>} 2025-05-07T20:33:31.2348255Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2348489Z context = 2025-05-07T20:33:31.2348494Z 2025-05-07T20:33:31.2348667Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2348939Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2349089Z module_map=module_map) 2025-05-07T20:33:31.2349252Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2349358Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2349434Z E ^ 2025-05-07T20:33:31.2349805Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2349812Z 2025-05-07T20:33:31.2350250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2350254Z 2025-05-07T20:33:31.2350360Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2350593Z self=, 2025-05-07T20:33:31.2350670Z T=128, 2025-05-07T20:33:31.2350792Z D=5120, 2025-05-07T20:33:31.2350882Z scale_ub=1200.0, 2025-05-07T20:33:31.2350970Z contiguous=True, 2025-05-07T20:33:31.2351056Z compiled=False, 2025-05-07T20:33:31.2351132Z ) 2025-05-07T20:33:31.2351355Z self = 2025-05-07T20:33:31.2351528Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:31.2351532Z 2025-05-07T20:33:31.2351618Z @given( 2025-05-07T20:33:31.2351738Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2351841Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2351953Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2352071Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2352190Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2352263Z ) 2025-05-07T20:33:31.2352514Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2352614Z def test_silu_mul_quant( 2025-05-07T20:33:31.2352696Z self, 2025-05-07T20:33:31.2352773Z T: int, 2025-05-07T20:33:31.2352855Z D: int, 2025-05-07T20:33:31.2352955Z scale_ub: Optional[float], 2025-05-07T20:33:31.2353044Z contiguous: bool, 2025-05-07T20:33:31.2353131Z compiled: bool, 2025-05-07T20:33:31.2353208Z ) -> None: 2025-05-07T20:33:31.2353302Z torch.manual_seed(2025) 2025-05-07T20:33:31.2353373Z 2025-05-07T20:33:31.2353540Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2353618Z 2025-05-07T20:33:31.2353704Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2353824Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2353916Z x = x_sign * x_clamp 2025-05-07T20:33:31.2353994Z x0 = x[:, :D] 2025-05-07T20:33:31.2354067Z x1 = x[:, D:] 2025-05-07T20:33:31.2354143Z 2025-05-07T20:33:31.2354226Z if contiguous: 2025-05-07T20:33:31.2354315Z x0 = x0.contiguous() 2025-05-07T20:33:31.2354405Z x1 = x1.contiguous() 2025-05-07T20:33:31.2354475Z 2025-05-07T20:33:31.2354564Z if scale_ub is not None: 2025-05-07T20:33:31.2354666Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2354812Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2354899Z ) 2025-05-07T20:33:31.2354986Z else: 2025-05-07T20:33:31.2355135Z scale_ub_tensor = None 2025-05-07T20:33:31.2355208Z 2025-05-07T20:33:31.2355334Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2355420Z op = silu_mul_quant 2025-05-07T20:33:31.2355503Z if compiled: 2025-05-07T20:33:31.2355642Z op = torch.compile(op) 2025-05-07T20:33:31.2355746Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2355819Z 2025-05-07T20:33:31.2355907Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2355913Z 2025-05-07T20:33:31.2356010Z moe/activation_test.py:117: 2025-05-07T20:33:31.2356183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2356280Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2356381Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2356906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2357003Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2357380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2357606Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2357967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2358058Z kernel = self.compile( 2025-05-07T20:33:31.2358498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2358681Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2358810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2358814Z 2025-05-07T20:33:31.2359023Z self = 2025-05-07T20:33:31.2359832Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2360346Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e50c1300>} 2025-05-07T20:33:31.2361143Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2361337Z context = 2025-05-07T20:33:31.2361341Z 2025-05-07T20:33:31.2361511Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2361779Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2361882Z module_map=module_map) 2025-05-07T20:33:31.2362044Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2362139Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2362214Z E ^ 2025-05-07T20:33:31.2362584Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2362589Z 2025-05-07T20:33:31.2363026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2363033Z 2025-05-07T20:33:31.2363137Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2363362Z self=, 2025-05-07T20:33:31.2363434Z T=1, 2025-05-07T20:33:31.2363511Z D=7168, 2025-05-07T20:33:31.2363590Z scale_ub=1200.0, 2025-05-07T20:33:31.2363676Z contiguous=True, 2025-05-07T20:33:31.2363800Z compiled=True, 2025-05-07T20:33:31.2363869Z ) 2025-05-07T20:33:31.2364094Z self = 2025-05-07T20:33:31.2364257Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:31.2364262Z 2025-05-07T20:33:31.2364333Z @given( 2025-05-07T20:33:31.2364530Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2364635Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2364771Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2364889Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2365038Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2365116Z ) 2025-05-07T20:33:31.2365365Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2365455Z def test_silu_mul_quant( 2025-05-07T20:33:31.2365533Z self, 2025-05-07T20:33:31.2365607Z T: int, 2025-05-07T20:33:31.2365683Z D: int, 2025-05-07T20:33:31.2365780Z scale_ub: Optional[float], 2025-05-07T20:33:31.2365865Z contiguous: bool, 2025-05-07T20:33:31.2365945Z compiled: bool, 2025-05-07T20:33:31.2366024Z ) -> None: 2025-05-07T20:33:31.2366116Z torch.manual_seed(2025) 2025-05-07T20:33:31.2366190Z 2025-05-07T20:33:31.2366361Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2366433Z 2025-05-07T20:33:31.2366569Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2366696Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2366784Z x = x_sign * x_clamp 2025-05-07T20:33:31.2366865Z x0 = x[:, :D] 2025-05-07T20:33:31.2366942Z x1 = x[:, D:] 2025-05-07T20:33:31.2367012Z 2025-05-07T20:33:31.2367095Z if contiguous: 2025-05-07T20:33:31.2367184Z x0 = x0.contiguous() 2025-05-07T20:33:31.2367270Z x1 = x1.contiguous() 2025-05-07T20:33:31.2367346Z 2025-05-07T20:33:31.2367433Z if scale_ub is not None: 2025-05-07T20:33:31.2367534Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2367669Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2367740Z ) 2025-05-07T20:33:31.2367815Z else: 2025-05-07T20:33:31.2367911Z scale_ub_tensor = None 2025-05-07T20:33:31.2367980Z 2025-05-07T20:33:31.2368113Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2368204Z op = silu_mul_quant 2025-05-07T20:33:31.2368286Z if compiled: 2025-05-07T20:33:31.2368387Z op = torch.compile(op) 2025-05-07T20:33:31.2368489Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2368561Z 2025-05-07T20:33:31.2368651Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2368656Z 2025-05-07T20:33:31.2368748Z moe/activation_test.py:117: 2025-05-07T20:33:31.2368876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2368978Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2369074Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2369458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2369550Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2370072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2370169Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2370544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2370770Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2371127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2371217Z kernel = self.compile( 2025-05-07T20:33:31.2371674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2371853Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2372026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2372031Z 2025-05-07T20:33:31.2372243Z self = 2025-05-07T20:33:31.2373057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2373636Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e50c2ac0>} 2025-05-07T20:33:31.2374427Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2374716Z context = 2025-05-07T20:33:31.2374722Z 2025-05-07T20:33:31.2374923Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2375237Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2375347Z module_map=module_map) 2025-05-07T20:33:31.2375513Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2375613Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2375698Z E ^ 2025-05-07T20:33:31.2376069Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2376073Z 2025-05-07T20:33:31.2376518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2376523Z 2025-05-07T20:33:31.2376629Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2376859Z self=, 2025-05-07T20:33:31.2376941Z T=1, 2025-05-07T20:33:31.2377022Z D=7168, 2025-05-07T20:33:31.2377106Z scale_ub=1200.0, 2025-05-07T20:33:31.2377194Z contiguous=False, 2025-05-07T20:33:31.2377280Z compiled=True, 2025-05-07T20:33:31.2377353Z ) 2025-05-07T20:33:31.2377582Z self = 2025-05-07T20:33:31.2377749Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:31.2377754Z 2025-05-07T20:33:31.2377835Z @given( 2025-05-07T20:33:31.2377953Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2378052Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2378179Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2378297Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2378412Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2378488Z ) 2025-05-07T20:33:31.2378744Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2378840Z def test_silu_mul_quant( 2025-05-07T20:33:31.2378919Z self, 2025-05-07T20:33:31.2378999Z T: int, 2025-05-07T20:33:31.2379082Z D: int, 2025-05-07T20:33:31.2379183Z scale_ub: Optional[float], 2025-05-07T20:33:31.2379274Z contiguous: bool, 2025-05-07T20:33:31.2379363Z compiled: bool, 2025-05-07T20:33:31.2379447Z ) -> None: 2025-05-07T20:33:31.2379544Z torch.manual_seed(2025) 2025-05-07T20:33:31.2379623Z 2025-05-07T20:33:31.2379795Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2379917Z 2025-05-07T20:33:31.2380013Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2380138Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2380230Z x = x_sign * x_clamp 2025-05-07T20:33:31.2380311Z x0 = x[:, :D] 2025-05-07T20:33:31.2380390Z x1 = x[:, D:] 2025-05-07T20:33:31.2380464Z 2025-05-07T20:33:31.2380586Z if contiguous: 2025-05-07T20:33:31.2380681Z x0 = x0.contiguous() 2025-05-07T20:33:31.2380775Z x1 = x1.contiguous() 2025-05-07T20:33:31.2380852Z 2025-05-07T20:33:31.2380944Z if scale_ub is not None: 2025-05-07T20:33:31.2381093Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2381230Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2381308Z ) 2025-05-07T20:33:31.2381388Z else: 2025-05-07T20:33:31.2381483Z scale_ub_tensor = None 2025-05-07T20:33:31.2381558Z 2025-05-07T20:33:31.2381692Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2381787Z op = silu_mul_quant 2025-05-07T20:33:31.2381875Z if compiled: 2025-05-07T20:33:31.2381975Z op = torch.compile(op) 2025-05-07T20:33:31.2382081Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2382158Z 2025-05-07T20:33:31.2382257Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2382262Z 2025-05-07T20:33:31.2382359Z moe/activation_test.py:117: 2025-05-07T20:33:31.2382538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2382644Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2382745Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2383135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2383228Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2383755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2383856Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2384233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2384473Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2384832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2384932Z kernel = self.compile( 2025-05-07T20:33:31.2385338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2385520Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2385656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2385661Z 2025-05-07T20:33:31.2385871Z self = 2025-05-07T20:33:31.2386687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2387207Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e50039c0>} 2025-05-07T20:33:31.2388002Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2388202Z context = 2025-05-07T20:33:31.2388206Z 2025-05-07T20:33:31.2388375Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2388651Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2388804Z module_map=module_map) 2025-05-07T20:33:31.2388966Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2389069Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2389145Z E ^ 2025-05-07T20:33:31.2389554Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2389559Z 2025-05-07T20:33:31.2390005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2390048Z 2025-05-07T20:33:31.2390152Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2390388Z self=, 2025-05-07T20:33:31.2390466Z T=1, 2025-05-07T20:33:31.2394430Z D=7168, 2025-05-07T20:33:31.2394550Z scale_ub=None, 2025-05-07T20:33:31.2394656Z contiguous=False, 2025-05-07T20:33:31.2394740Z compiled=True, 2025-05-07T20:33:31.2394817Z ) 2025-05-07T20:33:31.2395043Z self = 2025-05-07T20:33:31.2395208Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:31.2395216Z 2025-05-07T20:33:31.2395301Z @given( 2025-05-07T20:33:31.2395421Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2395589Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2395712Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2395838Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2395963Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2396037Z ) 2025-05-07T20:33:31.2396325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2396426Z def test_silu_mul_quant( 2025-05-07T20:33:31.2396502Z self, 2025-05-07T20:33:31.2396582Z T: int, 2025-05-07T20:33:31.2396663Z D: int, 2025-05-07T20:33:31.2396766Z scale_ub: Optional[float], 2025-05-07T20:33:31.2396861Z contiguous: bool, 2025-05-07T20:33:31.2396949Z compiled: bool, 2025-05-07T20:33:31.2397028Z ) -> None: 2025-05-07T20:33:31.2397127Z torch.manual_seed(2025) 2025-05-07T20:33:31.2397200Z 2025-05-07T20:33:31.2397369Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2397448Z 2025-05-07T20:33:31.2397540Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2397665Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2397755Z x = x_sign * x_clamp 2025-05-07T20:33:31.2397835Z x0 = x[:, :D] 2025-05-07T20:33:31.2397913Z x1 = x[:, D:] 2025-05-07T20:33:31.2397988Z 2025-05-07T20:33:31.2398073Z if contiguous: 2025-05-07T20:33:31.2398162Z x0 = x0.contiguous() 2025-05-07T20:33:31.2398259Z x1 = x1.contiguous() 2025-05-07T20:33:31.2398330Z 2025-05-07T20:33:31.2398420Z if scale_ub is not None: 2025-05-07T20:33:31.2398527Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2398661Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2398740Z ) 2025-05-07T20:33:31.2398819Z else: 2025-05-07T20:33:31.2398912Z scale_ub_tensor = None 2025-05-07T20:33:31.2398987Z 2025-05-07T20:33:31.2399118Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2399209Z op = silu_mul_quant 2025-05-07T20:33:31.2399298Z if compiled: 2025-05-07T20:33:31.2399396Z op = torch.compile(op) 2025-05-07T20:33:31.2399501Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2399577Z 2025-05-07T20:33:31.2399667Z y_fp8, y_scale = fn() 2025-05-07T20:33:31.2399791Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:31.2399913Z 2025-05-07T20:33:31.2400049Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2400154Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:31.2400254Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:31.2400375Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:31.2400557Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2400632Z 2025-05-07T20:33:31.2400731Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:31.2400739Z 2025-05-07T20:33:31.2400838Z moe/activation_test.py:126: 2025-05-07T20:33:31.2401008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2401116Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:31.2401251Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:31.2401845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:31.2401954Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:31.2402332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2402564Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2402954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:31.2403261Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:31.2403669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:31.2403843Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:31.2404202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:31.2404289Z fn() 2025-05-07T20:33:31.2404712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:31.2404794Z self.fn.run( 2025-05-07T20:33:31.2405195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2405294Z kernel = self.compile( 2025-05-07T20:33:31.2405704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2405882Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2406016Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2406021Z 2025-05-07T20:33:31.2406232Z self = 2025-05-07T20:33:31.2407048Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2407566Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099a478b80>} 2025-05-07T20:33:31.2408360Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2408560Z context = 2025-05-07T20:33:31.2408564Z 2025-05-07T20:33:31.2408730Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2409000Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2409109Z module_map=module_map) 2025-05-07T20:33:31.2409316Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2409421Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:31.2409503Z E ^ 2025-05-07T20:33:31.2409873Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2409941Z 2025-05-07T20:33:31.2410383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2410390Z 2025-05-07T20:33:31.2410498Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2410765Z self=, 2025-05-07T20:33:31.2410846Z T=1, 2025-05-07T20:33:31.2410922Z D=5120, 2025-05-07T20:33:31.2411005Z scale_ub=1200.0, 2025-05-07T20:33:31.2411095Z contiguous=False, 2025-05-07T20:33:31.2411178Z compiled=True, 2025-05-07T20:33:31.2411250Z ) 2025-05-07T20:33:31.2411481Z self = 2025-05-07T20:33:31.2411651Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:31.2411655Z 2025-05-07T20:33:31.2411737Z @given( 2025-05-07T20:33:31.2411856Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2411956Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2412072Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2412231Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2412345Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2412427Z ) 2025-05-07T20:33:31.2412680Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2412777Z def test_silu_mul_quant( 2025-05-07T20:33:31.2412854Z self, 2025-05-07T20:33:31.2412930Z T: int, 2025-05-07T20:33:31.2413009Z D: int, 2025-05-07T20:33:31.2413108Z scale_ub: Optional[float], 2025-05-07T20:33:31.2413200Z contiguous: bool, 2025-05-07T20:33:31.2413289Z compiled: bool, 2025-05-07T20:33:31.2413365Z ) -> None: 2025-05-07T20:33:31.2413462Z torch.manual_seed(2025) 2025-05-07T20:33:31.2413537Z 2025-05-07T20:33:31.2413713Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2413787Z 2025-05-07T20:33:31.2413881Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2414005Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2414099Z x = x_sign * x_clamp 2025-05-07T20:33:31.2414189Z x0 = x[:, :D] 2025-05-07T20:33:31.2414265Z x1 = x[:, D:] 2025-05-07T20:33:31.2414339Z 2025-05-07T20:33:31.2414422Z if contiguous: 2025-05-07T20:33:31.2414596Z x0 = x0.contiguous() 2025-05-07T20:33:31.2414690Z x1 = x1.contiguous() 2025-05-07T20:33:31.2414762Z 2025-05-07T20:33:31.2414853Z if scale_ub is not None: 2025-05-07T20:33:31.2414982Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2415134Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2415226Z ) 2025-05-07T20:33:31.2415307Z else: 2025-05-07T20:33:31.2415401Z scale_ub_tensor = None 2025-05-07T20:33:31.2415476Z 2025-05-07T20:33:31.2415611Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2415702Z op = silu_mul_quant 2025-05-07T20:33:31.2415789Z if compiled: 2025-05-07T20:33:31.2415891Z op = torch.compile(op) 2025-05-07T20:33:31.2415997Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2416073Z 2025-05-07T20:33:31.2416164Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2416168Z 2025-05-07T20:33:31.2416265Z moe/activation_test.py:117: 2025-05-07T20:33:31.2416402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2416504Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2416653Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2417045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2417138Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2417798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2417897Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2418276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2418548Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2418904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2419006Z kernel = self.compile( 2025-05-07T20:33:31.2419409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2419592Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2419726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2419730Z 2025-05-07T20:33:31.2419942Z self = 2025-05-07T20:33:31.2420805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2421333Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099a479e40>} 2025-05-07T20:33:31.2422131Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2422334Z context = 2025-05-07T20:33:31.2422338Z 2025-05-07T20:33:31.2422509Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2422793Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2422906Z module_map=module_map) 2025-05-07T20:33:31.2423071Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2423180Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2423260Z E ^ 2025-05-07T20:33:31.2423633Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2423638Z 2025-05-07T20:33:31.2424081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2424088Z 2025-05-07T20:33:31.2424195Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2424432Z self=, 2025-05-07T20:33:31.2424513Z T=1, 2025-05-07T20:33:31.2424592Z D=5120, 2025-05-07T20:33:31.2424685Z scale_ub=1200.0, 2025-05-07T20:33:31.2424775Z contiguous=False, 2025-05-07T20:33:31.2424863Z compiled=False, 2025-05-07T20:33:31.2424946Z ) 2025-05-07T20:33:31.2425177Z self = 2025-05-07T20:33:31.2425356Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:31.2425360Z 2025-05-07T20:33:31.2425639Z @given( 2025-05-07T20:33:31.2425811Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2425948Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2426070Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2426283Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2426406Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2426484Z ) 2025-05-07T20:33:31.2426745Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2426906Z def test_silu_mul_quant( 2025-05-07T20:33:31.2426987Z self, 2025-05-07T20:33:31.2427071Z T: int, 2025-05-07T20:33:31.2427151Z D: int, 2025-05-07T20:33:31.2427251Z scale_ub: Optional[float], 2025-05-07T20:33:31.2427347Z contiguous: bool, 2025-05-07T20:33:31.2427495Z compiled: bool, 2025-05-07T20:33:31.2427575Z ) -> None: 2025-05-07T20:33:31.2427672Z torch.manual_seed(2025) 2025-05-07T20:33:31.2427745Z 2025-05-07T20:33:31.2427915Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2427994Z 2025-05-07T20:33:31.2428086Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2428209Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2428305Z x = x_sign * x_clamp 2025-05-07T20:33:31.2428385Z x0 = x[:, :D] 2025-05-07T20:33:31.2428465Z x1 = x[:, D:] 2025-05-07T20:33:31.2428542Z 2025-05-07T20:33:31.2428623Z if contiguous: 2025-05-07T20:33:31.2428723Z x0 = x0.contiguous() 2025-05-07T20:33:31.2428811Z x1 = x1.contiguous() 2025-05-07T20:33:31.2428886Z 2025-05-07T20:33:31.2428979Z if scale_ub is not None: 2025-05-07T20:33:31.2429145Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2429284Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2429364Z ) 2025-05-07T20:33:31.2429440Z else: 2025-05-07T20:33:31.2429534Z scale_ub_tensor = None 2025-05-07T20:33:31.2429609Z 2025-05-07T20:33:31.2429738Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2429828Z op = silu_mul_quant 2025-05-07T20:33:31.2429918Z if compiled: 2025-05-07T20:33:31.2430016Z op = torch.compile(op) 2025-05-07T20:33:31.2430125Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2430199Z 2025-05-07T20:33:31.2430290Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2430295Z 2025-05-07T20:33:31.2430397Z moe/activation_test.py:117: 2025-05-07T20:33:31.2430529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2430633Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2430735Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2431265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2431366Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2431742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2431969Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2432330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2432423Z kernel = self.compile( 2025-05-07T20:33:31.2432827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2433009Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2433139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2433145Z 2025-05-07T20:33:31.2433356Z self = 2025-05-07T20:33:31.2434165Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2434728Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f099a47aac0>} 2025-05-07T20:33:31.2435563Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2435763Z context = 2025-05-07T20:33:31.2435767Z 2025-05-07T20:33:31.2435978Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2436251Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2436359Z module_map=module_map) 2025-05-07T20:33:31.2436526Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2436628Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2436714Z E ^ 2025-05-07T20:33:31.2437086Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2437091Z 2025-05-07T20:33:31.2437529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2437534Z 2025-05-07T20:33:31.2437645Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2437921Z self=, 2025-05-07T20:33:31.2438011Z T=16384, 2025-05-07T20:33:31.2438093Z D=5120, 2025-05-07T20:33:31.2438177Z scale_ub=1200.0, 2025-05-07T20:33:31.2438270Z contiguous=False, 2025-05-07T20:33:31.2438359Z compiled=True, 2025-05-07T20:33:31.2438437Z ) 2025-05-07T20:33:31.2438666Z self = 2025-05-07T20:33:31.2438850Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:31.2438858Z 2025-05-07T20:33:31.2438942Z @given( 2025-05-07T20:33:31.2439067Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2439169Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2439292Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2439415Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2439532Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2439614Z ) 2025-05-07T20:33:31.2439872Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2439972Z def test_silu_mul_quant( 2025-05-07T20:33:31.2440057Z self, 2025-05-07T20:33:31.2440140Z T: int, 2025-05-07T20:33:31.2440221Z D: int, 2025-05-07T20:33:31.2440331Z scale_ub: Optional[float], 2025-05-07T20:33:31.2440426Z contiguous: bool, 2025-05-07T20:33:31.2440516Z compiled: bool, 2025-05-07T20:33:31.2440609Z ) -> None: 2025-05-07T20:33:31.2440708Z torch.manual_seed(2025) 2025-05-07T20:33:31.2440790Z 2025-05-07T20:33:31.2440963Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2441041Z 2025-05-07T20:33:31.2441143Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2441274Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2441367Z x = x_sign * x_clamp 2025-05-07T20:33:31.2441455Z x0 = x[:, :D] 2025-05-07T20:33:31.2441540Z x1 = x[:, D:] 2025-05-07T20:33:31.2441618Z 2025-05-07T20:33:31.2441711Z if contiguous: 2025-05-07T20:33:31.2441808Z x0 = x0.contiguous() 2025-05-07T20:33:31.2441904Z x1 = x1.contiguous() 2025-05-07T20:33:31.2441982Z 2025-05-07T20:33:31.2442078Z if scale_ub is not None: 2025-05-07T20:33:31.2442187Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2442328Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2442481Z ) 2025-05-07T20:33:31.2442566Z else: 2025-05-07T20:33:31.2442663Z scale_ub_tensor = None 2025-05-07T20:33:31.2442740Z 2025-05-07T20:33:31.2442875Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2442969Z op = silu_mul_quant 2025-05-07T20:33:31.2443098Z if compiled: 2025-05-07T20:33:31.2443204Z op = torch.compile(op) 2025-05-07T20:33:31.2443315Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2443391Z 2025-05-07T20:33:31.2443490Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2443534Z 2025-05-07T20:33:31.2443638Z moe/activation_test.py:117: 2025-05-07T20:33:31.2443777Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2443881Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2443984Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2444374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2444476Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2445003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2445114Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2445493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2445767Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2446129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2446227Z kernel = self.compile( 2025-05-07T20:33:31.2446635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2446814Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2446947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2446956Z 2025-05-07T20:33:31.2447168Z self = 2025-05-07T20:33:31.2447988Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2448508Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4d58180>} 2025-05-07T20:33:31.2449305Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2449505Z context = 2025-05-07T20:33:31.2449509Z 2025-05-07T20:33:31.2449678Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2449949Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2450061Z module_map=module_map) 2025-05-07T20:33:31.2450224Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2450329Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2450409Z E ^ 2025-05-07T20:33:31.2450781Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2450785Z 2025-05-07T20:33:31.2451228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2451233Z 2025-05-07T20:33:31.2451337Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2451614Z self=, 2025-05-07T20:33:31.2451696Z T=2048, 2025-05-07T20:33:31.2451774Z D=7168, 2025-05-07T20:33:31.2451857Z scale_ub=1200.0, 2025-05-07T20:33:31.2451942Z contiguous=False, 2025-05-07T20:33:31.2452023Z compiled=True, 2025-05-07T20:33:31.2452138Z ) 2025-05-07T20:33:31.2452362Z self = 2025-05-07T20:33:31.2452540Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:31.2452582Z 2025-05-07T20:33:31.2452662Z @given( 2025-05-07T20:33:31.2452779Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2452876Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2452993Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2453108Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2453223Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2453295Z ) 2025-05-07T20:33:31.2453546Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2453639Z def test_silu_mul_quant( 2025-05-07T20:33:31.2453714Z self, 2025-05-07T20:33:31.2453791Z T: int, 2025-05-07T20:33:31.2453871Z D: int, 2025-05-07T20:33:31.2453968Z scale_ub: Optional[float], 2025-05-07T20:33:31.2454052Z contiguous: bool, 2025-05-07T20:33:31.2454138Z compiled: bool, 2025-05-07T20:33:31.2454257Z ) -> None: 2025-05-07T20:33:31.2454358Z torch.manual_seed(2025) 2025-05-07T20:33:31.2454433Z 2025-05-07T20:33:31.2454702Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2454792Z 2025-05-07T20:33:31.2454890Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2455014Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2455108Z x = x_sign * x_clamp 2025-05-07T20:33:31.2455192Z x0 = x[:, :D] 2025-05-07T20:33:31.2455273Z x1 = x[:, D:] 2025-05-07T20:33:31.2455350Z 2025-05-07T20:33:31.2455433Z if contiguous: 2025-05-07T20:33:31.2455525Z x0 = x0.contiguous() 2025-05-07T20:33:31.2455617Z x1 = x1.contiguous() 2025-05-07T20:33:31.2455688Z 2025-05-07T20:33:31.2455788Z if scale_ub is not None: 2025-05-07T20:33:31.2455893Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2456031Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2456112Z ) 2025-05-07T20:33:31.2456192Z else: 2025-05-07T20:33:31.2456287Z scale_ub_tensor = None 2025-05-07T20:33:31.2456369Z 2025-05-07T20:33:31.2456495Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2456582Z op = silu_mul_quant 2025-05-07T20:33:31.2456663Z if compiled: 2025-05-07T20:33:31.2456760Z op = torch.compile(op) 2025-05-07T20:33:31.2456866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2456940Z 2025-05-07T20:33:31.2457027Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2457031Z 2025-05-07T20:33:31.2457127Z moe/activation_test.py:117: 2025-05-07T20:33:31.2457261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2457358Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2457460Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2457845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2457936Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2458458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2458552Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2458929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2459206Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2459563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2459658Z kernel = self.compile( 2025-05-07T20:33:31.2460097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2460284Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2460453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2460458Z 2025-05-07T20:33:31.2460667Z self = 2025-05-07T20:33:31.2461482Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2462003Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4d58ea0>} 2025-05-07T20:33:31.2462805Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2463038Z context = 2025-05-07T20:33:31.2463046Z 2025-05-07T20:33:31.2463215Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2463492Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2463599Z module_map=module_map) 2025-05-07T20:33:31.2463765Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2463867Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2463946Z E ^ 2025-05-07T20:33:31.2464333Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2464339Z 2025-05-07T20:33:31.2464821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2464826Z 2025-05-07T20:33:31.2464936Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2465166Z self=, 2025-05-07T20:33:31.2465246Z T=1, 2025-05-07T20:33:31.2465323Z D=5120, 2025-05-07T20:33:31.2465401Z scale_ub=None, 2025-05-07T20:33:31.2465485Z contiguous=False, 2025-05-07T20:33:31.2465569Z compiled=False, 2025-05-07T20:33:31.2465638Z ) 2025-05-07T20:33:31.2465860Z self = 2025-05-07T20:33:31.2466033Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:31.2466037Z 2025-05-07T20:33:31.2466113Z @given( 2025-05-07T20:33:31.2466229Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2466328Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2466443Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2466560Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2466673Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2466745Z ) 2025-05-07T20:33:31.2467002Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2467091Z def test_silu_mul_quant( 2025-05-07T20:33:31.2467167Z self, 2025-05-07T20:33:31.2467243Z T: int, 2025-05-07T20:33:31.2467317Z D: int, 2025-05-07T20:33:31.2467413Z scale_ub: Optional[float], 2025-05-07T20:33:31.2467501Z contiguous: bool, 2025-05-07T20:33:31.2467629Z compiled: bool, 2025-05-07T20:33:31.2467708Z ) -> None: 2025-05-07T20:33:31.2467800Z torch.manual_seed(2025) 2025-05-07T20:33:31.2467869Z 2025-05-07T20:33:31.2468040Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2468112Z 2025-05-07T20:33:31.2468241Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2468368Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2468452Z x = x_sign * x_clamp 2025-05-07T20:33:31.2468531Z x0 = x[:, :D] 2025-05-07T20:33:31.2468656Z x1 = x[:, D:] 2025-05-07T20:33:31.2468730Z 2025-05-07T20:33:31.2468815Z if contiguous: 2025-05-07T20:33:31.2468909Z x0 = x0.contiguous() 2025-05-07T20:33:31.2469000Z x1 = x1.contiguous() 2025-05-07T20:33:31.2469075Z 2025-05-07T20:33:31.2469170Z if scale_ub is not None: 2025-05-07T20:33:31.2469274Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2469417Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2469491Z ) 2025-05-07T20:33:31.2469567Z else: 2025-05-07T20:33:31.2469665Z scale_ub_tensor = None 2025-05-07T20:33:31.2469736Z 2025-05-07T20:33:31.2469862Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2469958Z op = silu_mul_quant 2025-05-07T20:33:31.2470039Z if compiled: 2025-05-07T20:33:31.2470133Z op = torch.compile(op) 2025-05-07T20:33:31.2470307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2470379Z 2025-05-07T20:33:31.2470468Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2470475Z 2025-05-07T20:33:31.2470568Z moe/activation_test.py:117: 2025-05-07T20:33:31.2470701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2470803Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2470902Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2471429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2471527Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2471905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2472137Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2472497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2472590Z kernel = self.compile( 2025-05-07T20:33:31.2472995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2473167Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2473295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2473302Z 2025-05-07T20:33:31.2473510Z self = 2025-05-07T20:33:31.2474324Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2474842Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4d59e40>} 2025-05-07T20:33:31.2475635Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2475830Z context = 2025-05-07T20:33:31.2475834Z 2025-05-07T20:33:31.2476001Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2476319Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2476427Z module_map=module_map) 2025-05-07T20:33:31.2476626Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2476725Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2476803Z E ^ 2025-05-07T20:33:31.2477172Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2477214Z 2025-05-07T20:33:31.2477655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2477660Z 2025-05-07T20:33:31.2477764Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2477996Z self=, 2025-05-07T20:33:31.2478084Z T=4096, 2025-05-07T20:33:31.2478162Z D=7168, 2025-05-07T20:33:31.2478243Z scale_ub=1200.0, 2025-05-07T20:33:31.2478328Z contiguous=False, 2025-05-07T20:33:31.2478410Z compiled=False, 2025-05-07T20:33:31.2478487Z ) 2025-05-07T20:33:31.2478707Z self = 2025-05-07T20:33:31.2478890Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:31.2478895Z 2025-05-07T20:33:31.2478973Z @given( 2025-05-07T20:33:31.2479127Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2479230Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2479346Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2479460Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2479570Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2479646Z ) 2025-05-07T20:33:31.2479897Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2479997Z def test_silu_mul_quant( 2025-05-07T20:33:31.2480072Z self, 2025-05-07T20:33:31.2480147Z T: int, 2025-05-07T20:33:31.2480222Z D: int, 2025-05-07T20:33:31.2480317Z scale_ub: Optional[float], 2025-05-07T20:33:31.2480405Z contiguous: bool, 2025-05-07T20:33:31.2480496Z compiled: bool, 2025-05-07T20:33:31.2480569Z ) -> None: 2025-05-07T20:33:31.2480660Z torch.manual_seed(2025) 2025-05-07T20:33:31.2480735Z 2025-05-07T20:33:31.2480906Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2480978Z 2025-05-07T20:33:31.2481069Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2481190Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2481280Z x = x_sign * x_clamp 2025-05-07T20:33:31.2481355Z x0 = x[:, :D] 2025-05-07T20:33:31.2481431Z x1 = x[:, D:] 2025-05-07T20:33:31.2481505Z 2025-05-07T20:33:31.2481586Z if contiguous: 2025-05-07T20:33:31.2481675Z x0 = x0.contiguous() 2025-05-07T20:33:31.2481763Z x1 = x1.contiguous() 2025-05-07T20:33:31.2481833Z 2025-05-07T20:33:31.2481923Z if scale_ub is not None: 2025-05-07T20:33:31.2482027Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2482162Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2482234Z ) 2025-05-07T20:33:31.2482311Z else: 2025-05-07T20:33:31.2482404Z scale_ub_tensor = None 2025-05-07T20:33:31.2482481Z 2025-05-07T20:33:31.2482607Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2482696Z op = silu_mul_quant 2025-05-07T20:33:31.2482779Z if compiled: 2025-05-07T20:33:31.2482874Z op = torch.compile(op) 2025-05-07T20:33:31.2482978Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2483053Z 2025-05-07T20:33:31.2483141Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2483192Z 2025-05-07T20:33:31.2483288Z moe/activation_test.py:117: 2025-05-07T20:33:31.2483423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2483522Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2483656Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2484189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2484288Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2484717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2484950Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2485314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2485411Z kernel = self.compile( 2025-05-07T20:33:31.2485818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2486003Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2486140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2486144Z 2025-05-07T20:33:31.2486354Z self = 2025-05-07T20:33:31.2487216Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2487734Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4d5b380>} 2025-05-07T20:33:31.2488532Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2488728Z context = 2025-05-07T20:33:31.2488733Z 2025-05-07T20:33:31.2488906Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2489184Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2489290Z module_map=module_map) 2025-05-07T20:33:31.2489458Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2489557Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2489632Z E ^ 2025-05-07T20:33:31.2490006Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2490011Z 2025-05-07T20:33:31.2490445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2490452Z 2025-05-07T20:33:31.2490560Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2490790Z self=, 2025-05-07T20:33:31.2490871Z T=16384, 2025-05-07T20:33:31.2490950Z D=7168, 2025-05-07T20:33:31.2491032Z scale_ub=None, 2025-05-07T20:33:31.2491117Z contiguous=True, 2025-05-07T20:33:31.2491207Z compiled=True, 2025-05-07T20:33:31.2491282Z ) 2025-05-07T20:33:31.2491508Z self = 2025-05-07T20:33:31.2491689Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:31.2491693Z 2025-05-07T20:33:31.2491771Z @given( 2025-05-07T20:33:31.2491889Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2491987Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2492145Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2492265Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2492375Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2492447Z ) 2025-05-07T20:33:31.2492737Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2492828Z def test_silu_mul_quant( 2025-05-07T20:33:31.2492902Z self, 2025-05-07T20:33:31.2492976Z T: int, 2025-05-07T20:33:31.2493052Z D: int, 2025-05-07T20:33:31.2493149Z scale_ub: Optional[float], 2025-05-07T20:33:31.2493275Z contiguous: bool, 2025-05-07T20:33:31.2493357Z compiled: bool, 2025-05-07T20:33:31.2493436Z ) -> None: 2025-05-07T20:33:31.2493528Z torch.manual_seed(2025) 2025-05-07T20:33:31.2493599Z 2025-05-07T20:33:31.2493774Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2493849Z 2025-05-07T20:33:31.2493939Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2494064Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2494150Z x = x_sign * x_clamp 2025-05-07T20:33:31.2494227Z x0 = x[:, :D] 2025-05-07T20:33:31.2494308Z x1 = x[:, D:] 2025-05-07T20:33:31.2494379Z 2025-05-07T20:33:31.2494523Z if contiguous: 2025-05-07T20:33:31.2494615Z x0 = x0.contiguous() 2025-05-07T20:33:31.2494702Z x1 = x1.contiguous() 2025-05-07T20:33:31.2494822Z 2025-05-07T20:33:31.2494911Z if scale_ub is not None: 2025-05-07T20:33:31.2495017Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2495152Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2495226Z ) 2025-05-07T20:33:31.2495299Z else: 2025-05-07T20:33:31.2495395Z scale_ub_tensor = None 2025-05-07T20:33:31.2495470Z 2025-05-07T20:33:31.2495597Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2495690Z op = silu_mul_quant 2025-05-07T20:33:31.2495771Z if compiled: 2025-05-07T20:33:31.2495865Z op = torch.compile(op) 2025-05-07T20:33:31.2495971Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2496042Z 2025-05-07T20:33:31.2496133Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2496138Z 2025-05-07T20:33:31.2496231Z moe/activation_test.py:117: 2025-05-07T20:33:31.2496363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2496464Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2496562Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2496948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2497041Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2497560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2497661Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2498036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2498264Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2498624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2498719Z kernel = self.compile( 2025-05-07T20:33:31.2499123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2499308Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2499439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2499443Z 2025-05-07T20:33:31.2499653Z self = 2025-05-07T20:33:31.2500513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2501069Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e46a84a0>} 2025-05-07T20:33:31.2501865Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2502121Z context = 2025-05-07T20:33:31.2502126Z 2025-05-07T20:33:31.2502299Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2502573Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2502685Z module_map=module_map) 2025-05-07T20:33:31.2502850Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2502949Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2503029Z E ^ 2025-05-07T20:33:31.2503403Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2503408Z 2025-05-07T20:33:31.2503883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2503894Z 2025-05-07T20:33:31.2503998Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2504227Z self=, 2025-05-07T20:33:31.2504311Z T=4096, 2025-05-07T20:33:31.2504388Z D=5120, 2025-05-07T20:33:31.2504473Z scale_ub=None, 2025-05-07T20:33:31.2504565Z contiguous=False, 2025-05-07T20:33:31.2504650Z compiled=True, 2025-05-07T20:33:31.2504721Z ) 2025-05-07T20:33:31.2504950Z self = 2025-05-07T20:33:31.2505129Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:31.2505134Z 2025-05-07T20:33:31.2505212Z @given( 2025-05-07T20:33:31.2505338Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2505441Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2505560Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2505680Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2505793Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2505872Z ) 2025-05-07T20:33:31.2506127Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2506219Z def test_silu_mul_quant( 2025-05-07T20:33:31.2506303Z self, 2025-05-07T20:33:31.2506385Z T: int, 2025-05-07T20:33:31.2506459Z D: int, 2025-05-07T20:33:31.2506561Z scale_ub: Optional[float], 2025-05-07T20:33:31.2506651Z contiguous: bool, 2025-05-07T20:33:31.2506740Z compiled: bool, 2025-05-07T20:33:31.2506814Z ) -> None: 2025-05-07T20:33:31.2506913Z torch.manual_seed(2025) 2025-05-07T20:33:31.2506990Z 2025-05-07T20:33:31.2507162Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2507240Z 2025-05-07T20:33:31.2507335Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2507467Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2507555Z x = x_sign * x_clamp 2025-05-07T20:33:31.2507639Z x0 = x[:, :D] 2025-05-07T20:33:31.2507717Z x1 = x[:, D:] 2025-05-07T20:33:31.2507793Z 2025-05-07T20:33:31.2507882Z if contiguous: 2025-05-07T20:33:31.2507975Z x0 = x0.contiguous() 2025-05-07T20:33:31.2508065Z x1 = x1.contiguous() 2025-05-07T20:33:31.2508193Z 2025-05-07T20:33:31.2508290Z if scale_ub is not None: 2025-05-07T20:33:31.2508399Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2508534Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2508605Z ) 2025-05-07T20:33:31.2508723Z else: 2025-05-07T20:33:31.2508816Z scale_ub_tensor = None 2025-05-07T20:33:31.2508888Z 2025-05-07T20:33:31.2509020Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2509107Z op = silu_mul_quant 2025-05-07T20:33:31.2509230Z if compiled: 2025-05-07T20:33:31.2509328Z op = torch.compile(op) 2025-05-07T20:33:31.2509434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2509507Z 2025-05-07T20:33:31.2509599Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2509604Z 2025-05-07T20:33:31.2509696Z moe/activation_test.py:117: 2025-05-07T20:33:31.2509837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2509934Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2510031Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2510422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2510513Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2511073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2511180Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2511554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2511785Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2512140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2512234Z kernel = self.compile( 2025-05-07T20:33:31.2512639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2512816Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2512950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2512955Z 2025-05-07T20:33:31.2513162Z self = 2025-05-07T20:33:31.2513975Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2514493Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e46a91c0>} 2025-05-07T20:33:31.2515289Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2515485Z context = 2025-05-07T20:33:31.2515490Z 2025-05-07T20:33:31.2515656Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2515927Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2516037Z module_map=module_map) 2025-05-07T20:33:31.2516196Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2516298Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2516370Z E ^ 2025-05-07T20:33:31.2516736Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2516785Z 2025-05-07T20:33:31.2517227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2517232Z 2025-05-07T20:33:31.2517335Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2521478Z self=, 2025-05-07T20:33:31.2521576Z T=4096, 2025-05-07T20:33:31.2521656Z D=5120, 2025-05-07T20:33:31.2521743Z scale_ub=1200.0, 2025-05-07T20:33:31.2521838Z contiguous=False, 2025-05-07T20:33:31.2521921Z compiled=False, 2025-05-07T20:33:31.2522042Z ) 2025-05-07T20:33:31.2522282Z self = 2025-05-07T20:33:31.2522471Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:31.2522476Z 2025-05-07T20:33:31.2522564Z @given( 2025-05-07T20:33:31.2522685Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2522792Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2522918Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2523036Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2523151Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2523235Z ) 2025-05-07T20:33:31.2523492Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2523598Z def test_silu_mul_quant( 2025-05-07T20:33:31.2523719Z self, 2025-05-07T20:33:31.2523798Z T: int, 2025-05-07T20:33:31.2523885Z D: int, 2025-05-07T20:33:31.2523982Z scale_ub: Optional[float], 2025-05-07T20:33:31.2524071Z contiguous: bool, 2025-05-07T20:33:31.2524158Z compiled: bool, 2025-05-07T20:33:31.2524236Z ) -> None: 2025-05-07T20:33:31.2524331Z torch.manual_seed(2025) 2025-05-07T20:33:31.2524405Z 2025-05-07T20:33:31.2524576Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2524652Z 2025-05-07T20:33:31.2524746Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2524869Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2524964Z x = x_sign * x_clamp 2025-05-07T20:33:31.2525044Z x0 = x[:, :D] 2025-05-07T20:33:31.2525124Z x1 = x[:, D:] 2025-05-07T20:33:31.2525198Z 2025-05-07T20:33:31.2525283Z if contiguous: 2025-05-07T20:33:31.2525374Z x0 = x0.contiguous() 2025-05-07T20:33:31.2525733Z x1 = x1.contiguous() 2025-05-07T20:33:31.2525847Z 2025-05-07T20:33:31.2525960Z if scale_ub is not None: 2025-05-07T20:33:31.2526077Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2526211Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2526288Z ) 2025-05-07T20:33:31.2526368Z else: 2025-05-07T20:33:31.2526465Z scale_ub_tensor = None 2025-05-07T20:33:31.2526542Z 2025-05-07T20:33:31.2526678Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2526768Z op = silu_mul_quant 2025-05-07T20:33:31.2526857Z if compiled: 2025-05-07T20:33:31.2526956Z op = torch.compile(op) 2025-05-07T20:33:31.2527061Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2527141Z 2025-05-07T20:33:31.2527232Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2527237Z 2025-05-07T20:33:31.2527335Z moe/activation_test.py:117: 2025-05-07T20:33:31.2527471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2527574Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2527674Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2528210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2528310Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2528784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2529015Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2529434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2529537Z kernel = self.compile( 2025-05-07T20:33:31.2529945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2530128Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2530331Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2530336Z 2025-05-07T20:33:31.2530546Z self = 2025-05-07T20:33:31.2531365Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2531890Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e46aa160>} 2025-05-07T20:33:31.2532747Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2532947Z context = 2025-05-07T20:33:31.2532952Z 2025-05-07T20:33:31.2533125Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2533402Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2533511Z module_map=module_map) 2025-05-07T20:33:31.2533685Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2533788Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2533868Z E ^ 2025-05-07T20:33:31.2534249Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2534254Z 2025-05-07T20:33:31.2534761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2534766Z 2025-05-07T20:33:31.2534880Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2535115Z self=, 2025-05-07T20:33:31.2535195Z T=4096, 2025-05-07T20:33:31.2535278Z D=5120, 2025-05-07T20:33:31.2535360Z scale_ub=1200.0, 2025-05-07T20:33:31.2535446Z contiguous=False, 2025-05-07T20:33:31.2535534Z compiled=True, 2025-05-07T20:33:31.2535606Z ) 2025-05-07T20:33:31.2535831Z self = 2025-05-07T20:33:31.2536015Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:31.2536019Z 2025-05-07T20:33:31.2536096Z @given( 2025-05-07T20:33:31.2536221Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2536325Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2536439Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2536561Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2536673Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2536749Z ) 2025-05-07T20:33:31.2537006Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2537099Z def test_silu_mul_quant( 2025-05-07T20:33:31.2537174Z self, 2025-05-07T20:33:31.2537255Z T: int, 2025-05-07T20:33:31.2537331Z D: int, 2025-05-07T20:33:31.2537430Z scale_ub: Optional[float], 2025-05-07T20:33:31.2537574Z contiguous: bool, 2025-05-07T20:33:31.2537661Z compiled: bool, 2025-05-07T20:33:31.2537743Z ) -> None: 2025-05-07T20:33:31.2537840Z torch.manual_seed(2025) 2025-05-07T20:33:31.2537912Z 2025-05-07T20:33:31.2538149Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2538225Z 2025-05-07T20:33:31.2538319Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2538450Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2538541Z x = x_sign * x_clamp 2025-05-07T20:33:31.2538659Z x0 = x[:, :D] 2025-05-07T20:33:31.2538744Z x1 = x[:, D:] 2025-05-07T20:33:31.2538816Z 2025-05-07T20:33:31.2538901Z if contiguous: 2025-05-07T20:33:31.2538996Z x0 = x0.contiguous() 2025-05-07T20:33:31.2539085Z x1 = x1.contiguous() 2025-05-07T20:33:31.2539163Z 2025-05-07T20:33:31.2539252Z if scale_ub is not None: 2025-05-07T20:33:31.2539363Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2539500Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2539577Z ) 2025-05-07T20:33:31.2539653Z else: 2025-05-07T20:33:31.2539749Z scale_ub_tensor = None 2025-05-07T20:33:31.2539823Z 2025-05-07T20:33:31.2539954Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2540053Z op = silu_mul_quant 2025-05-07T20:33:31.2540179Z if compiled: 2025-05-07T20:33:31.2540280Z op = torch.compile(op) 2025-05-07T20:33:31.2540392Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2540464Z 2025-05-07T20:33:31.2540559Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2540564Z 2025-05-07T20:33:31.2540661Z moe/activation_test.py:117: 2025-05-07T20:33:31.2540795Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2540902Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2541005Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2541392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2541492Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2542018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2542122Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2542502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2542737Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2543097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2543192Z kernel = self.compile( 2025-05-07T20:33:31.2543594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2543779Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2543912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2543917Z 2025-05-07T20:33:31.2544132Z self = 2025-05-07T20:33:31.2544949Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2545471Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e46ab240>} 2025-05-07T20:33:31.2546270Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2546510Z context = 2025-05-07T20:33:31.2546515Z 2025-05-07T20:33:31.2546726Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2546999Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2547109Z module_map=module_map) 2025-05-07T20:33:31.2547276Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2547413Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2547493Z E ^ 2025-05-07T20:33:31.2547868Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2547873Z 2025-05-07T20:33:31.2548310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2548316Z 2025-05-07T20:33:31.2548427Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2548656Z self=, 2025-05-07T20:33:31.2548739Z T=2048, 2025-05-07T20:33:31.2548815Z D=7168, 2025-05-07T20:33:31.2548900Z scale_ub=1200.0, 2025-05-07T20:33:31.2548992Z contiguous=False, 2025-05-07T20:33:31.2549076Z compiled=False, 2025-05-07T20:33:31.2549149Z ) 2025-05-07T20:33:31.2549417Z self = 2025-05-07T20:33:31.2549601Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:31.2549606Z 2025-05-07T20:33:31.2549682Z @given( 2025-05-07T20:33:31.2549807Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2549905Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2550025Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2550145Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2550260Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2550337Z ) 2025-05-07T20:33:31.2550588Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2550686Z def test_silu_mul_quant( 2025-05-07T20:33:31.2550765Z self, 2025-05-07T20:33:31.2550841Z T: int, 2025-05-07T20:33:31.2550918Z D: int, 2025-05-07T20:33:31.2551026Z scale_ub: Optional[float], 2025-05-07T20:33:31.2551118Z contiguous: bool, 2025-05-07T20:33:31.2551202Z compiled: bool, 2025-05-07T20:33:31.2551283Z ) -> None: 2025-05-07T20:33:31.2551380Z torch.manual_seed(2025) 2025-05-07T20:33:31.2551455Z 2025-05-07T20:33:31.2551627Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2551701Z 2025-05-07T20:33:31.2551796Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2551922Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2552012Z x = x_sign * x_clamp 2025-05-07T20:33:31.2552097Z x0 = x[:, :D] 2025-05-07T20:33:31.2552175Z x1 = x[:, D:] 2025-05-07T20:33:31.2552247Z 2025-05-07T20:33:31.2552332Z if contiguous: 2025-05-07T20:33:31.2552426Z x0 = x0.contiguous() 2025-05-07T20:33:31.2552518Z x1 = x1.contiguous() 2025-05-07T20:33:31.2552593Z 2025-05-07T20:33:31.2552689Z if scale_ub is not None: 2025-05-07T20:33:31.2552796Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2552933Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2553007Z ) 2025-05-07T20:33:31.2553089Z else: 2025-05-07T20:33:31.2553183Z scale_ub_tensor = None 2025-05-07T20:33:31.2553256Z 2025-05-07T20:33:31.2553390Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2553480Z op = silu_mul_quant 2025-05-07T20:33:31.2553692Z if compiled: 2025-05-07T20:33:31.2553796Z op = torch.compile(op) 2025-05-07T20:33:31.2553900Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2553975Z 2025-05-07T20:33:31.2554070Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2554075Z 2025-05-07T20:33:31.2554211Z moe/activation_test.py:117: 2025-05-07T20:33:31.2554350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2554456Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2554556Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2555127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2555229Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2555612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2555852Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2556212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2556313Z kernel = self.compile( 2025-05-07T20:33:31.2556723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2556941Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2557081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2557088Z 2025-05-07T20:33:31.2557301Z self = 2025-05-07T20:33:31.2558123Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2558644Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e51a4220>} 2025-05-07T20:33:31.2559446Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2559648Z context = 2025-05-07T20:33:31.2559657Z 2025-05-07T20:33:31.2559828Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2560106Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2560216Z module_map=module_map) 2025-05-07T20:33:31.2560383Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2560492Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2560571Z E ^ 2025-05-07T20:33:31.2560945Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2560951Z 2025-05-07T20:33:31.2561395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2561399Z 2025-05-07T20:33:31.2561508Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2561745Z self=, 2025-05-07T20:33:31.2561828Z T=1, 2025-05-07T20:33:31.2561908Z D=7168, 2025-05-07T20:33:31.2561993Z scale_ub=None, 2025-05-07T20:33:31.2562077Z contiguous=True, 2025-05-07T20:33:31.2562159Z compiled=False, 2025-05-07T20:33:31.2562233Z ) 2025-05-07T20:33:31.2562455Z self = 2025-05-07T20:33:31.2562623Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:31.2562672Z 2025-05-07T20:33:31.2562749Z @given( 2025-05-07T20:33:31.2562868Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2562970Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2563126Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2563244Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2563362Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2563439Z ) 2025-05-07T20:33:31.2563694Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2563828Z def test_silu_mul_quant( 2025-05-07T20:33:31.2563903Z self, 2025-05-07T20:33:31.2563982Z T: int, 2025-05-07T20:33:31.2564060Z D: int, 2025-05-07T20:33:31.2564159Z scale_ub: Optional[float], 2025-05-07T20:33:31.2564250Z contiguous: bool, 2025-05-07T20:33:31.2564335Z compiled: bool, 2025-05-07T20:33:31.2564415Z ) -> None: 2025-05-07T20:33:31.2564512Z torch.manual_seed(2025) 2025-05-07T20:33:31.2564585Z 2025-05-07T20:33:31.2564757Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2564849Z 2025-05-07T20:33:31.2564955Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2565107Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2565199Z x = x_sign * x_clamp 2025-05-07T20:33:31.2565277Z x0 = x[:, :D] 2025-05-07T20:33:31.2565405Z x1 = x[:, D:] 2025-05-07T20:33:31.2565484Z 2025-05-07T20:33:31.2565569Z if contiguous: 2025-05-07T20:33:31.2565665Z x0 = x0.contiguous() 2025-05-07T20:33:31.2565755Z x1 = x1.contiguous() 2025-05-07T20:33:31.2565829Z 2025-05-07T20:33:31.2565926Z if scale_ub is not None: 2025-05-07T20:33:31.2566032Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2566167Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2566249Z ) 2025-05-07T20:33:31.2566325Z else: 2025-05-07T20:33:31.2566418Z scale_ub_tensor = None 2025-05-07T20:33:31.2566495Z 2025-05-07T20:33:31.2566625Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2566722Z op = silu_mul_quant 2025-05-07T20:33:31.2566806Z if compiled: 2025-05-07T20:33:31.2566905Z op = torch.compile(op) 2025-05-07T20:33:31.2567015Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2567089Z 2025-05-07T20:33:31.2567184Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2567189Z 2025-05-07T20:33:31.2567287Z moe/activation_test.py:117: 2025-05-07T20:33:31.2567420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2567524Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2567627Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2568153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2568260Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2568640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2568871Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2569237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2569332Z kernel = self.compile( 2025-05-07T20:33:31.2569736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2569915Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2570045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2570050Z 2025-05-07T20:33:31.2570334Z self = 2025-05-07T20:33:31.2571187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2571711Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e51a5120>} 2025-05-07T20:33:31.2572548Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2572744Z context = 2025-05-07T20:33:31.2572748Z 2025-05-07T20:33:31.2572923Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2573199Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2573315Z module_map=module_map) 2025-05-07T20:33:31.2573482Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2573590Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2573674Z E ^ 2025-05-07T20:33:31.2574092Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2574098Z 2025-05-07T20:33:31.2574603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2574612Z 2025-05-07T20:33:31.2574719Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2574977Z self=, 2025-05-07T20:33:31.2575072Z T=16384, 2025-05-07T20:33:31.2575168Z D=7168, 2025-05-07T20:33:31.2575259Z scale_ub=1200.0, 2025-05-07T20:33:31.2575350Z contiguous=False, 2025-05-07T20:33:31.2575436Z compiled=True, 2025-05-07T20:33:31.2575515Z ) 2025-05-07T20:33:31.2575744Z self = 2025-05-07T20:33:31.2575932Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:31.2575937Z 2025-05-07T20:33:31.2576021Z @given( 2025-05-07T20:33:31.2576141Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2576239Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2576360Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2576477Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2576589Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2576671Z ) 2025-05-07T20:33:31.2576923Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2577017Z def test_silu_mul_quant( 2025-05-07T20:33:31.2577097Z self, 2025-05-07T20:33:31.2577176Z T: int, 2025-05-07T20:33:31.2577255Z D: int, 2025-05-07T20:33:31.2577359Z scale_ub: Optional[float], 2025-05-07T20:33:31.2577451Z contiguous: bool, 2025-05-07T20:33:31.2577545Z compiled: bool, 2025-05-07T20:33:31.2577623Z ) -> None: 2025-05-07T20:33:31.2577719Z torch.manual_seed(2025) 2025-05-07T20:33:31.2577795Z 2025-05-07T20:33:31.2577969Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2578046Z 2025-05-07T20:33:31.2578147Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2578271Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2578362Z x = x_sign * x_clamp 2025-05-07T20:33:31.2578442Z x0 = x[:, :D] 2025-05-07T20:33:31.2578522Z x1 = x[:, D:] 2025-05-07T20:33:31.2578594Z 2025-05-07T20:33:31.2578683Z if contiguous: 2025-05-07T20:33:31.2578826Z x0 = x0.contiguous() 2025-05-07T20:33:31.2578919Z x1 = x1.contiguous() 2025-05-07T20:33:31.2578992Z 2025-05-07T20:33:31.2579082Z if scale_ub is not None: 2025-05-07T20:33:31.2579190Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2579364Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2579440Z ) 2025-05-07T20:33:31.2579516Z else: 2025-05-07T20:33:31.2579616Z scale_ub_tensor = None 2025-05-07T20:33:31.2579690Z 2025-05-07T20:33:31.2579829Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2579964Z op = silu_mul_quant 2025-05-07T20:33:31.2580056Z if compiled: 2025-05-07T20:33:31.2580157Z op = torch.compile(op) 2025-05-07T20:33:31.2580267Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2580342Z 2025-05-07T20:33:31.2580434Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2580440Z 2025-05-07T20:33:31.2580540Z moe/activation_test.py:117: 2025-05-07T20:33:31.2580682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2580786Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2580890Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2581335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2581430Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2582075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2582179Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2582604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2582862Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2583268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2583366Z kernel = self.compile( 2025-05-07T20:33:31.2583823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2584019Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2584159Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2584166Z 2025-05-07T20:33:31.2584396Z self = 2025-05-07T20:33:31.2585366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2585982Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e51a6520>} 2025-05-07T20:33:31.2586906Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2587128Z context = 2025-05-07T20:33:31.2587133Z 2025-05-07T20:33:31.2587315Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2587625Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2587737Z module_map=module_map) 2025-05-07T20:33:31.2587912Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2588016Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2588092Z E ^ 2025-05-07T20:33:31.2588515Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2588600Z 2025-05-07T20:33:31.2589040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2589045Z 2025-05-07T20:33:31.2589193Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2589430Z self=, 2025-05-07T20:33:31.2589512Z T=1, 2025-05-07T20:33:31.2589596Z D=7168, 2025-05-07T20:33:31.2589684Z scale_ub=None, 2025-05-07T20:33:31.2590371Z contiguous=False, 2025-05-07T20:33:31.2590459Z compiled=False, 2025-05-07T20:33:31.2590541Z ) 2025-05-07T20:33:31.2590770Z self = 2025-05-07T20:33:31.2590943Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:31.2590948Z 2025-05-07T20:33:31.2591032Z @given( 2025-05-07T20:33:31.2591155Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2591259Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2591377Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2591494Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2591619Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2591699Z ) 2025-05-07T20:33:31.2591952Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2592097Z def test_silu_mul_quant( 2025-05-07T20:33:31.2592177Z self, 2025-05-07T20:33:31.2592257Z T: int, 2025-05-07T20:33:31.2592331Z D: int, 2025-05-07T20:33:31.2592428Z scale_ub: Optional[float], 2025-05-07T20:33:31.2592520Z contiguous: bool, 2025-05-07T20:33:31.2592604Z compiled: bool, 2025-05-07T20:33:31.2592680Z ) -> None: 2025-05-07T20:33:31.2592780Z torch.manual_seed(2025) 2025-05-07T20:33:31.2592856Z 2025-05-07T20:33:31.2593028Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2593106Z 2025-05-07T20:33:31.2593199Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2593322Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2593417Z x = x_sign * x_clamp 2025-05-07T20:33:31.2593498Z x0 = x[:, :D] 2025-05-07T20:33:31.2593581Z x1 = x[:, D:] 2025-05-07T20:33:31.2593653Z 2025-05-07T20:33:31.2593739Z if contiguous: 2025-05-07T20:33:31.2593834Z x0 = x0.contiguous() 2025-05-07T20:33:31.2593926Z x1 = x1.contiguous() 2025-05-07T20:33:31.2593999Z 2025-05-07T20:33:31.2594092Z if scale_ub is not None: 2025-05-07T20:33:31.2594197Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2594329Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2594410Z ) 2025-05-07T20:33:31.2594485Z else: 2025-05-07T20:33:31.2594580Z scale_ub_tensor = None 2025-05-07T20:33:31.2594655Z 2025-05-07T20:33:31.2594786Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2594878Z op = silu_mul_quant 2025-05-07T20:33:31.2594966Z if compiled: 2025-05-07T20:33:31.2595067Z op = torch.compile(op) 2025-05-07T20:33:31.2595174Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2595246Z 2025-05-07T20:33:31.2595335Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2595342Z 2025-05-07T20:33:31.2595441Z moe/activation_test.py:117: 2025-05-07T20:33:31.2595575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2595673Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2595775Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2596300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2596452Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2596830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2597056Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2597459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2597555Z kernel = self.compile( 2025-05-07T20:33:31.2597960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2598178Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2598308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2598313Z 2025-05-07T20:33:31.2598524Z self = 2025-05-07T20:33:31.2599337Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2599858Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e51a7100>} 2025-05-07T20:33:31.2600714Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2600911Z context = 2025-05-07T20:33:31.2600915Z 2025-05-07T20:33:31.2601089Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2601358Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2601471Z module_map=module_map) 2025-05-07T20:33:31.2601633Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2601732Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2601811Z E ^ 2025-05-07T20:33:31.2602185Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2602190Z 2025-05-07T20:33:31.2602628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2602635Z 2025-05-07T20:33:31.2602742Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2602969Z self=, 2025-05-07T20:33:31.2603049Z T=2048, 2025-05-07T20:33:31.2603126Z D=7168, 2025-05-07T20:33:31.2603210Z scale_ub=None, 2025-05-07T20:33:31.2603298Z contiguous=False, 2025-05-07T20:33:31.2603383Z compiled=True, 2025-05-07T20:33:31.2603456Z ) 2025-05-07T20:33:31.2603682Z self = 2025-05-07T20:33:31.2603858Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:31.2603863Z 2025-05-07T20:33:31.2603941Z @given( 2025-05-07T20:33:31.2604067Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2604164Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2604283Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2604396Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2604507Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2604582Z ) 2025-05-07T20:33:31.2604831Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2604924Z def test_silu_mul_quant( 2025-05-07T20:33:31.2604999Z self, 2025-05-07T20:33:31.2605074Z T: int, 2025-05-07T20:33:31.2605198Z D: int, 2025-05-07T20:33:31.2605297Z scale_ub: Optional[float], 2025-05-07T20:33:31.2605383Z contiguous: bool, 2025-05-07T20:33:31.2605466Z compiled: bool, 2025-05-07T20:33:31.2605545Z ) -> None: 2025-05-07T20:33:31.2605636Z torch.manual_seed(2025) 2025-05-07T20:33:31.2605750Z 2025-05-07T20:33:31.2605921Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2605994Z 2025-05-07T20:33:31.2606092Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2606212Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2606335Z x = x_sign * x_clamp 2025-05-07T20:33:31.2606417Z x0 = x[:, :D] 2025-05-07T20:33:31.2606495Z x1 = x[:, D:] 2025-05-07T20:33:31.2606564Z 2025-05-07T20:33:31.2606651Z if contiguous: 2025-05-07T20:33:31.2606739Z x0 = x0.contiguous() 2025-05-07T20:33:31.2606826Z x1 = x1.contiguous() 2025-05-07T20:33:31.2606899Z 2025-05-07T20:33:31.2606985Z if scale_ub is not None: 2025-05-07T20:33:31.2607090Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2607219Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2607295Z ) 2025-05-07T20:33:31.2607372Z else: 2025-05-07T20:33:31.2607466Z scale_ub_tensor = None 2025-05-07T20:33:31.2607534Z 2025-05-07T20:33:31.2607663Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2607793Z op = silu_mul_quant 2025-05-07T20:33:31.2607877Z if compiled: 2025-05-07T20:33:31.2607979Z op = torch.compile(op) 2025-05-07T20:33:31.2608082Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2608153Z 2025-05-07T20:33:31.2608243Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2608248Z 2025-05-07T20:33:31.2608339Z moe/activation_test.py:117: 2025-05-07T20:33:31.2608471Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2608573Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2608671Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2609057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2609150Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2609673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2609770Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2610147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2610376Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2610729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2610820Z kernel = self.compile( 2025-05-07T20:33:31.2611227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2611401Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2611535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2611542Z 2025-05-07T20:33:31.2611747Z self = 2025-05-07T20:33:31.2612560Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2613076Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4704720>} 2025-05-07T20:33:31.2613865Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2614107Z context = 2025-05-07T20:33:31.2614112Z 2025-05-07T20:33:31.2614316Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2614648Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2614759Z module_map=module_map) 2025-05-07T20:33:31.2614964Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2615066Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2615140Z E ^ 2025-05-07T20:33:31.2615510Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2615515Z 2025-05-07T20:33:31.2615956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2615960Z 2025-05-07T20:33:31.2616062Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2616290Z self=, 2025-05-07T20:33:31.2616373Z T=4096, 2025-05-07T20:33:31.2616451Z D=7168, 2025-05-07T20:33:31.2616533Z scale_ub=None, 2025-05-07T20:33:31.2616617Z contiguous=False, 2025-05-07T20:33:31.2616737Z compiled=True, 2025-05-07T20:33:31.2616813Z ) 2025-05-07T20:33:31.2617043Z self = 2025-05-07T20:33:31.2617222Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:31.2617226Z 2025-05-07T20:33:31.2617308Z @given( 2025-05-07T20:33:31.2617427Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2617527Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2617649Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2617767Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2617883Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2617955Z ) 2025-05-07T20:33:31.2618212Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2618310Z def test_silu_mul_quant( 2025-05-07T20:33:31.2618385Z self, 2025-05-07T20:33:31.2618466Z T: int, 2025-05-07T20:33:31.2618554Z D: int, 2025-05-07T20:33:31.2618658Z scale_ub: Optional[float], 2025-05-07T20:33:31.2618747Z contiguous: bool, 2025-05-07T20:33:31.2618836Z compiled: bool, 2025-05-07T20:33:31.2618909Z ) -> None: 2025-05-07T20:33:31.2619000Z torch.manual_seed(2025) 2025-05-07T20:33:31.2619074Z 2025-05-07T20:33:31.2619246Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2619324Z 2025-05-07T20:33:31.2619412Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2619534Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2619622Z x = x_sign * x_clamp 2025-05-07T20:33:31.2619697Z x0 = x[:, :D] 2025-05-07T20:33:31.2619774Z x1 = x[:, D:] 2025-05-07T20:33:31.2619850Z 2025-05-07T20:33:31.2619929Z if contiguous: 2025-05-07T20:33:31.2620017Z x0 = x0.contiguous() 2025-05-07T20:33:31.2620106Z x1 = x1.contiguous() 2025-05-07T20:33:31.2620179Z 2025-05-07T20:33:31.2620268Z if scale_ub is not None: 2025-05-07T20:33:31.2620376Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2620508Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2620588Z ) 2025-05-07T20:33:31.2620665Z else: 2025-05-07T20:33:31.2620754Z scale_ub_tensor = None 2025-05-07T20:33:31.2620828Z 2025-05-07T20:33:31.2620956Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2621093Z op = silu_mul_quant 2025-05-07T20:33:31.2621178Z if compiled: 2025-05-07T20:33:31.2621276Z op = torch.compile(op) 2025-05-07T20:33:31.2621382Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2621454Z 2025-05-07T20:33:31.2621582Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2621587Z 2025-05-07T20:33:31.2621682Z moe/activation_test.py:117: 2025-05-07T20:33:31.2621819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2621955Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2622055Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2622440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2622530Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2623056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2623157Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2623531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2623765Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2624123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2624252Z kernel = self.compile( 2025-05-07T20:33:31.2624659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2624841Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2624976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2624981Z 2025-05-07T20:33:31.2625190Z self = 2025-05-07T20:33:31.2626292Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2626816Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4705440>} 2025-05-07T20:33:31.2627612Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2627817Z context = 2025-05-07T20:33:31.2627822Z 2025-05-07T20:33:31.2627991Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2628266Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2628376Z module_map=module_map) 2025-05-07T20:33:31.2628539Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2628641Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2628721Z E ^ 2025-05-07T20:33:31.2629098Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2629103Z 2025-05-07T20:33:31.2629541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2629547Z 2025-05-07T20:33:31.2629650Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2629887Z self=, 2025-05-07T20:33:31.2629965Z T=16384, 2025-05-07T20:33:31.2630045Z D=5120, 2025-05-07T20:33:31.2630132Z scale_ub=1200.0, 2025-05-07T20:33:31.2630311Z contiguous=False, 2025-05-07T20:33:31.2630397Z compiled=False, 2025-05-07T20:33:31.2630472Z ) 2025-05-07T20:33:31.2630696Z self = 2025-05-07T20:33:31.2630885Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:31.2630949Z 2025-05-07T20:33:31.2631028Z @given( 2025-05-07T20:33:31.2631146Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2631254Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2631368Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2631571Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2631687Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2631762Z ) 2025-05-07T20:33:31.2632020Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2632112Z def test_silu_mul_quant( 2025-05-07T20:33:31.2632191Z self, 2025-05-07T20:33:31.2632272Z T: int, 2025-05-07T20:33:31.2632349Z D: int, 2025-05-07T20:33:31.2632445Z scale_ub: Optional[float], 2025-05-07T20:33:31.2632538Z contiguous: bool, 2025-05-07T20:33:31.2632625Z compiled: bool, 2025-05-07T20:33:31.2632703Z ) -> None: 2025-05-07T20:33:31.2632805Z torch.manual_seed(2025) 2025-05-07T20:33:31.2632877Z 2025-05-07T20:33:31.2633045Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2633179Z 2025-05-07T20:33:31.2633271Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2633397Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2633483Z x = x_sign * x_clamp 2025-05-07T20:33:31.2633558Z x0 = x[:, :D] 2025-05-07T20:33:31.2633638Z x1 = x[:, D:] 2025-05-07T20:33:31.2633711Z 2025-05-07T20:33:31.2633791Z if contiguous: 2025-05-07T20:33:31.2633886Z x0 = x0.contiguous() 2025-05-07T20:33:31.2633975Z x1 = x1.contiguous() 2025-05-07T20:33:31.2634044Z 2025-05-07T20:33:31.2634135Z if scale_ub is not None: 2025-05-07T20:33:31.2634235Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2634366Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2634446Z ) 2025-05-07T20:33:31.2634522Z else: 2025-05-07T20:33:31.2634614Z scale_ub_tensor = None 2025-05-07T20:33:31.2634685Z 2025-05-07T20:33:31.2634815Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2634907Z op = silu_mul_quant 2025-05-07T20:33:31.2634987Z if compiled: 2025-05-07T20:33:31.2635084Z op = torch.compile(op) 2025-05-07T20:33:31.2635193Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2635263Z 2025-05-07T20:33:31.2635354Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2635359Z 2025-05-07T20:33:31.2635455Z moe/activation_test.py:117: 2025-05-07T20:33:31.2635587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2635686Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2635784Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2636308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2636409Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2636784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2637013Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2637374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2637466Z kernel = self.compile( 2025-05-07T20:33:31.2637868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2638090Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2638219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2638224Z 2025-05-07T20:33:31.2638472Z self = 2025-05-07T20:33:31.2639289Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2639845Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4706340>} 2025-05-07T20:33:31.2640640Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2640837Z context = 2025-05-07T20:33:31.2640842Z 2025-05-07T20:33:31.2641015Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2641290Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2641398Z module_map=module_map) 2025-05-07T20:33:31.2641600Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2641706Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2641789Z E ^ 2025-05-07T20:33:31.2642160Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2642165Z 2025-05-07T20:33:31.2642605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2642613Z 2025-05-07T20:33:31.2642718Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2642946Z self=, 2025-05-07T20:33:31.2643031Z T=16384, 2025-05-07T20:33:31.2643108Z D=5120, 2025-05-07T20:33:31.2643191Z scale_ub=1200.0, 2025-05-07T20:33:31.2643281Z contiguous=True, 2025-05-07T20:33:31.2643365Z compiled=True, 2025-05-07T20:33:31.2643436Z ) 2025-05-07T20:33:31.2643668Z self = 2025-05-07T20:33:31.2643849Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:31.2643854Z 2025-05-07T20:33:31.2643932Z @given( 2025-05-07T20:33:31.2644053Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2644153Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2644271Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2644389Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2644503Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2644589Z ) 2025-05-07T20:33:31.2644839Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2648837Z def test_silu_mul_quant( 2025-05-07T20:33:31.2648942Z self, 2025-05-07T20:33:31.2649027Z T: int, 2025-05-07T20:33:31.2649112Z D: int, 2025-05-07T20:33:31.2649223Z scale_ub: Optional[float], 2025-05-07T20:33:31.2649323Z contiguous: bool, 2025-05-07T20:33:31.2649418Z compiled: bool, 2025-05-07T20:33:31.2649509Z ) -> None: 2025-05-07T20:33:31.2649609Z torch.manual_seed(2025) 2025-05-07T20:33:31.2649688Z 2025-05-07T20:33:31.2649869Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2649944Z 2025-05-07T20:33:31.2650045Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2650172Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2650330Z x = x_sign * x_clamp 2025-05-07T20:33:31.2650418Z x0 = x[:, :D] 2025-05-07T20:33:31.2650499Z x1 = x[:, D:] 2025-05-07T20:33:31.2650574Z 2025-05-07T20:33:31.2650666Z if contiguous: 2025-05-07T20:33:31.2650759Z x0 = x0.contiguous() 2025-05-07T20:33:31.2650894Z x1 = x1.contiguous() 2025-05-07T20:33:31.2650974Z 2025-05-07T20:33:31.2651066Z if scale_ub is not None: 2025-05-07T20:33:31.2651175Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2651315Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2651431Z ) 2025-05-07T20:33:31.2651511Z else: 2025-05-07T20:33:31.2651605Z scale_ub_tensor = None 2025-05-07T20:33:31.2651679Z 2025-05-07T20:33:31.2651812Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2651903Z op = silu_mul_quant 2025-05-07T20:33:31.2651994Z if compiled: 2025-05-07T20:33:31.2652103Z op = torch.compile(op) 2025-05-07T20:33:31.2652211Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2652285Z 2025-05-07T20:33:31.2652381Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2652387Z 2025-05-07T20:33:31.2652484Z moe/activation_test.py:117: 2025-05-07T20:33:31.2652625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2652726Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2652871Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2653273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2653370Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2653895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2654003Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2654383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2654719Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2655081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2655180Z kernel = self.compile( 2025-05-07T20:33:31.2655591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2655776Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2655912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2655921Z 2025-05-07T20:33:31.2656133Z self = 2025-05-07T20:33:31.2656943Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2657473Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e47079c0>} 2025-05-07T20:33:31.2658271Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2658475Z context = 2025-05-07T20:33:31.2658480Z 2025-05-07T20:33:31.2658651Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2658929Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2659046Z module_map=module_map) 2025-05-07T20:33:31.2659263Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2659366Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2659451Z E ^ 2025-05-07T20:33:31.2659864Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2659869Z 2025-05-07T20:33:31.2660317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2660322Z 2025-05-07T20:33:31.2660468Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2660701Z self=, 2025-05-07T20:33:31.2660790Z T=16384, 2025-05-07T20:33:31.2660872Z D=5120, 2025-05-07T20:33:31.2660968Z scale_ub=None, 2025-05-07T20:33:31.2661060Z contiguous=False, 2025-05-07T20:33:31.2661148Z compiled=True, 2025-05-07T20:33:31.2661231Z ) 2025-05-07T20:33:31.2661466Z self = 2025-05-07T20:33:31.2661652Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:31.2661657Z 2025-05-07T20:33:31.2661743Z @given( 2025-05-07T20:33:31.2661869Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2661972Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2662095Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2662260Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2662389Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2662467Z ) 2025-05-07T20:33:31.2662723Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2662824Z def test_silu_mul_quant( 2025-05-07T20:33:31.2662905Z self, 2025-05-07T20:33:31.2662987Z T: int, 2025-05-07T20:33:31.2663077Z D: int, 2025-05-07T20:33:31.2663182Z scale_ub: Optional[float], 2025-05-07T20:33:31.2663277Z contiguous: bool, 2025-05-07T20:33:31.2663369Z compiled: bool, 2025-05-07T20:33:31.2663453Z ) -> None: 2025-05-07T20:33:31.2663550Z torch.manual_seed(2025) 2025-05-07T20:33:31.2663630Z 2025-05-07T20:33:31.2663805Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2663884Z 2025-05-07T20:33:31.2663976Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2664106Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2664203Z x = x_sign * x_clamp 2025-05-07T20:33:31.2664284Z x0 = x[:, :D] 2025-05-07T20:33:31.2664364Z x1 = x[:, D:] 2025-05-07T20:33:31.2664441Z 2025-05-07T20:33:31.2664524Z if contiguous: 2025-05-07T20:33:31.2664618Z x0 = x0.contiguous() 2025-05-07T20:33:31.2664716Z x1 = x1.contiguous() 2025-05-07T20:33:31.2664790Z 2025-05-07T20:33:31.2664880Z if scale_ub is not None: 2025-05-07T20:33:31.2664995Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2665129Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2665207Z ) 2025-05-07T20:33:31.2665290Z else: 2025-05-07T20:33:31.2665384Z scale_ub_tensor = None 2025-05-07T20:33:31.2665466Z 2025-05-07T20:33:31.2665599Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2665690Z op = silu_mul_quant 2025-05-07T20:33:31.2665783Z if compiled: 2025-05-07T20:33:31.2665883Z op = torch.compile(op) 2025-05-07T20:33:31.2665993Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2666069Z 2025-05-07T20:33:31.2666163Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2666167Z 2025-05-07T20:33:31.2666265Z moe/activation_test.py:117: 2025-05-07T20:33:31.2666404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2666554Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2666657Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2667048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2667142Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2667729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2667831Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2668208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2668483Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2668841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2668938Z kernel = self.compile( 2025-05-07T20:33:31.2669341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2669523Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2669662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2669670Z 2025-05-07T20:33:31.2669879Z self = 2025-05-07T20:33:31.2670738Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2671260Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e5484c20>} 2025-05-07T20:33:31.2672058Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2672262Z context = 2025-05-07T20:33:31.2672267Z 2025-05-07T20:33:31.2672443Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2672724Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2672837Z module_map=module_map) 2025-05-07T20:33:31.2673006Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2673115Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2673195Z E ^ 2025-05-07T20:33:31.2673571Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2673579Z 2025-05-07T20:33:31.2674018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2674026Z 2025-05-07T20:33:31.2674136Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2674375Z self=, 2025-05-07T20:33:31.2674461Z T=2048, 2025-05-07T20:33:31.2674542Z D=5120, 2025-05-07T20:33:31.2674637Z scale_ub=None, 2025-05-07T20:33:31.2674724Z contiguous=False, 2025-05-07T20:33:31.2674819Z compiled=True, 2025-05-07T20:33:31.2674903Z ) 2025-05-07T20:33:31.2675129Z self = 2025-05-07T20:33:31.2675317Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:31.2675321Z 2025-05-07T20:33:31.2675405Z @given( 2025-05-07T20:33:31.2675527Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2675636Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2675756Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2675921Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2676045Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2676121Z ) 2025-05-07T20:33:31.2676381Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2676515Z def test_silu_mul_quant( 2025-05-07T20:33:31.2676595Z self, 2025-05-07T20:33:31.2676677Z T: int, 2025-05-07T20:33:31.2676755Z D: int, 2025-05-07T20:33:31.2676858Z scale_ub: Optional[float], 2025-05-07T20:33:31.2676995Z contiguous: bool, 2025-05-07T20:33:31.2677079Z compiled: bool, 2025-05-07T20:33:31.2677158Z ) -> None: 2025-05-07T20:33:31.2677258Z torch.manual_seed(2025) 2025-05-07T20:33:31.2677331Z 2025-05-07T20:33:31.2677503Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2677586Z 2025-05-07T20:33:31.2677680Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2677813Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2677902Z x = x_sign * x_clamp 2025-05-07T20:33:31.2677981Z x0 = x[:, :D] 2025-05-07T20:33:31.2678065Z x1 = x[:, D:] 2025-05-07T20:33:31.2678136Z 2025-05-07T20:33:31.2678221Z if contiguous: 2025-05-07T20:33:31.2678324Z x0 = x0.contiguous() 2025-05-07T20:33:31.2678417Z x1 = x1.contiguous() 2025-05-07T20:33:31.2678489Z 2025-05-07T20:33:31.2678626Z if scale_ub is not None: 2025-05-07T20:33:31.2678737Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2678874Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2678956Z ) 2025-05-07T20:33:31.2679037Z else: 2025-05-07T20:33:31.2679132Z scale_ub_tensor = None 2025-05-07T20:33:31.2679208Z 2025-05-07T20:33:31.2679341Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2679440Z op = silu_mul_quant 2025-05-07T20:33:31.2679525Z if compiled: 2025-05-07T20:33:31.2679626Z op = torch.compile(op) 2025-05-07T20:33:31.2679735Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2679807Z 2025-05-07T20:33:31.2679900Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2679907Z 2025-05-07T20:33:31.2680008Z moe/activation_test.py:117: 2025-05-07T20:33:31.2680145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2680249Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2680358Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2680745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2680846Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2681370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2681472Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2681854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2682084Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2682449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2682550Z kernel = self.compile( 2025-05-07T20:33:31.2682958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2683145Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2683276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2683281Z 2025-05-07T20:33:31.2683493Z self = 2025-05-07T20:33:31.2684309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2684939Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e54859e0>} 2025-05-07T20:33:31.2685769Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2686007Z context = 2025-05-07T20:33:31.2686011Z 2025-05-07T20:33:31.2686189Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2686466Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2686579Z module_map=module_map) 2025-05-07T20:33:31.2686748Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2686852Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2686933Z E ^ 2025-05-07T20:33:31.2687313Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2687318Z 2025-05-07T20:33:31.2687888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2687895Z 2025-05-07T20:33:31.2688015Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2688252Z self=, 2025-05-07T20:33:31.2688337Z T=2048, 2025-05-07T20:33:31.2688420Z D=5120, 2025-05-07T20:33:31.2688509Z scale_ub=1200.0, 2025-05-07T20:33:31.2688596Z contiguous=False, 2025-05-07T20:33:31.2688684Z compiled=True, 2025-05-07T20:33:31.2688762Z ) 2025-05-07T20:33:31.2688991Z self = 2025-05-07T20:33:31.2689180Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:31.2689184Z 2025-05-07T20:33:31.2689269Z @given( 2025-05-07T20:33:31.2689393Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2689497Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2689619Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2689748Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2689864Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2689942Z ) 2025-05-07T20:33:31.2690202Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2690299Z def test_silu_mul_quant( 2025-05-07T20:33:31.2690383Z self, 2025-05-07T20:33:31.2690463Z T: int, 2025-05-07T20:33:31.2690547Z D: int, 2025-05-07T20:33:31.2690654Z scale_ub: Optional[float], 2025-05-07T20:33:31.2690747Z contiguous: bool, 2025-05-07T20:33:31.2690839Z compiled: bool, 2025-05-07T20:33:31.2690922Z ) -> None: 2025-05-07T20:33:31.2691018Z torch.manual_seed(2025) 2025-05-07T20:33:31.2691095Z 2025-05-07T20:33:31.2691274Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2691349Z 2025-05-07T20:33:31.2691445Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2691576Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2691667Z x = x_sign * x_clamp 2025-05-07T20:33:31.2691752Z x0 = x[:, :D] 2025-05-07T20:33:31.2691834Z x1 = x[:, D:] 2025-05-07T20:33:31.2691908Z 2025-05-07T20:33:31.2691995Z if contiguous: 2025-05-07T20:33:31.2692088Z x0 = x0.contiguous() 2025-05-07T20:33:31.2692181Z x1 = x1.contiguous() 2025-05-07T20:33:31.2692307Z 2025-05-07T20:33:31.2692399Z if scale_ub is not None: 2025-05-07T20:33:31.2692507Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2692649Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2692725Z ) 2025-05-07T20:33:31.2692801Z else: 2025-05-07T20:33:31.2692941Z scale_ub_tensor = None 2025-05-07T20:33:31.2693017Z 2025-05-07T20:33:31.2693150Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2693253Z op = silu_mul_quant 2025-05-07T20:33:31.2693382Z if compiled: 2025-05-07T20:33:31.2693487Z op = torch.compile(op) 2025-05-07T20:33:31.2693595Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2693670Z 2025-05-07T20:33:31.2693772Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2693777Z 2025-05-07T20:33:31.2693876Z moe/activation_test.py:117: 2025-05-07T20:33:31.2694012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2694121Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2694220Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2694720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2694826Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2695394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2695503Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2695883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2696115Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2696476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2696575Z kernel = self.compile( 2025-05-07T20:33:31.2696986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2697163Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2697301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2697306Z 2025-05-07T20:33:31.2697523Z self = 2025-05-07T20:33:31.2698340Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2698868Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e5486b60>} 2025-05-07T20:33:31.2699665Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2699859Z context = 2025-05-07T20:33:31.2699867Z 2025-05-07T20:33:31.2700043Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2700318Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2700432Z module_map=module_map) 2025-05-07T20:33:31.2700595Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2700695Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2700781Z E ^ 2025-05-07T20:33:31.2701152Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2701157Z 2025-05-07T20:33:31.2701681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2701685Z 2025-05-07T20:33:31.2701793Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2702065Z self=, 2025-05-07T20:33:31.2702152Z T=4096, 2025-05-07T20:33:31.2702235Z D=5120, 2025-05-07T20:33:31.2702323Z scale_ub=1200.0, 2025-05-07T20:33:31.2702425Z contiguous=True, 2025-05-07T20:33:31.2702514Z compiled=True, 2025-05-07T20:33:31.2702639Z ) 2025-05-07T20:33:31.2702869Z self = 2025-05-07T20:33:31.2703051Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:31.2703056Z 2025-05-07T20:33:31.2703141Z @given( 2025-05-07T20:33:31.2703261Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2703366Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2703494Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2703616Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2703735Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2703818Z ) 2025-05-07T20:33:31.2704078Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2704179Z def test_silu_mul_quant( 2025-05-07T20:33:31.2704265Z self, 2025-05-07T20:33:31.2704390Z T: int, 2025-05-07T20:33:31.2704476Z D: int, 2025-05-07T20:33:31.2704582Z scale_ub: Optional[float], 2025-05-07T20:33:31.2704676Z contiguous: bool, 2025-05-07T20:33:31.2704772Z compiled: bool, 2025-05-07T20:33:31.2704855Z ) -> None: 2025-05-07T20:33:31.2704951Z torch.manual_seed(2025) 2025-05-07T20:33:31.2705028Z 2025-05-07T20:33:31.2705201Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2705282Z 2025-05-07T20:33:31.2705381Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2705505Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2705594Z x = x_sign * x_clamp 2025-05-07T20:33:31.2705676Z x0 = x[:, :D] 2025-05-07T20:33:31.2705755Z x1 = x[:, D:] 2025-05-07T20:33:31.2705830Z 2025-05-07T20:33:31.2705919Z if contiguous: 2025-05-07T20:33:31.2706012Z x0 = x0.contiguous() 2025-05-07T20:33:31.2706107Z x1 = x1.contiguous() 2025-05-07T20:33:31.2706182Z 2025-05-07T20:33:31.2706280Z if scale_ub is not None: 2025-05-07T20:33:31.2706395Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2706532Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2706612Z ) 2025-05-07T20:33:31.2706693Z else: 2025-05-07T20:33:31.2706790Z scale_ub_tensor = None 2025-05-07T20:33:31.2706866Z 2025-05-07T20:33:31.2707002Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2707098Z op = silu_mul_quant 2025-05-07T20:33:31.2707185Z if compiled: 2025-05-07T20:33:31.2707289Z op = torch.compile(op) 2025-05-07T20:33:31.2707394Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2707470Z 2025-05-07T20:33:31.2707568Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2707573Z 2025-05-07T20:33:31.2707671Z moe/activation_test.py:117: 2025-05-07T20:33:31.2707806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2707914Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2708013Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2708406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2708503Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2709032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2709183Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2709565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2709838Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2710201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2710306Z kernel = self.compile( 2025-05-07T20:33:31.2710753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2710936Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2711079Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2711084Z 2025-05-07T20:33:31.2711297Z self = 2025-05-07T20:33:31.2712122Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2712646Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4590180>} 2025-05-07T20:33:31.2713483Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2713689Z context = 2025-05-07T20:33:31.2713694Z 2025-05-07T20:33:31.2713867Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2714147Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2714266Z module_map=module_map) 2025-05-07T20:33:31.2714458Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2714583Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2714668Z E ^ 2025-05-07T20:33:31.2715042Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2715055Z 2025-05-07T20:33:31.2715494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2715501Z 2025-05-07T20:33:31.2715610Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2715846Z self=, 2025-05-07T20:33:31.2715929Z T=128, 2025-05-07T20:33:31.2716011Z D=5120, 2025-05-07T20:33:31.2716104Z scale_ub=1200.0, 2025-05-07T20:33:31.2716199Z contiguous=False, 2025-05-07T20:33:31.2716286Z compiled=True, 2025-05-07T20:33:31.2716370Z ) 2025-05-07T20:33:31.2716599Z self = 2025-05-07T20:33:31.2716791Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:31.2716796Z 2025-05-07T20:33:31.2716880Z @given( 2025-05-07T20:33:31.2717002Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2717115Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2717236Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2717360Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2717480Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2717561Z ) 2025-05-07T20:33:31.2717823Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2717921Z def test_silu_mul_quant( 2025-05-07T20:33:31.2718053Z self, 2025-05-07T20:33:31.2718138Z T: int, 2025-05-07T20:33:31.2718221Z D: int, 2025-05-07T20:33:31.2718321Z scale_ub: Optional[float], 2025-05-07T20:33:31.2718415Z contiguous: bool, 2025-05-07T20:33:31.2718501Z compiled: bool, 2025-05-07T20:33:31.2718579Z ) -> None: 2025-05-07T20:33:31.2718729Z torch.manual_seed(2025) 2025-05-07T20:33:31.2718811Z 2025-05-07T20:33:31.2718990Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2719074Z 2025-05-07T20:33:31.2719211Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2719340Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2719435Z x = x_sign * x_clamp 2025-05-07T20:33:31.2719518Z x0 = x[:, :D] 2025-05-07T20:33:31.2719605Z x1 = x[:, D:] 2025-05-07T20:33:31.2719681Z 2025-05-07T20:33:31.2719769Z if contiguous: 2025-05-07T20:33:31.2719870Z x0 = x0.contiguous() 2025-05-07T20:33:31.2719967Z x1 = x1.contiguous() 2025-05-07T20:33:31.2720047Z 2025-05-07T20:33:31.2720145Z if scale_ub is not None: 2025-05-07T20:33:31.2720256Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2720395Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2720482Z ) 2025-05-07T20:33:31.2720564Z else: 2025-05-07T20:33:31.2720661Z scale_ub_tensor = None 2025-05-07T20:33:31.2720742Z 2025-05-07T20:33:31.2720920Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2721022Z op = silu_mul_quant 2025-05-07T20:33:31.2721109Z if compiled: 2025-05-07T20:33:31.2721208Z op = torch.compile(op) 2025-05-07T20:33:31.2721320Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2721393Z 2025-05-07T20:33:31.2721485Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2721489Z 2025-05-07T20:33:31.2721592Z moe/activation_test.py:117: 2025-05-07T20:33:31.2721728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2721828Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2721932Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2722321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2722421Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2722947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2723047Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2723426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2723657Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2724016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2724116Z kernel = self.compile( 2025-05-07T20:33:31.2724564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2724759Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2724892Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2724896Z 2025-05-07T20:33:31.2725105Z self = 2025-05-07T20:33:31.2726200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2726724Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4590ea0>} 2025-05-07T20:33:31.2727613Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2727873Z context = 2025-05-07T20:33:31.2727879Z 2025-05-07T20:33:31.2728057Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2728338Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2728510Z module_map=module_map) 2025-05-07T20:33:31.2728681Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2728788Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2728869Z E ^ 2025-05-07T20:33:31.2729250Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2729256Z 2025-05-07T20:33:31.2729695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2729700Z 2025-05-07T20:33:31.2729812Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2730052Z self=, 2025-05-07T20:33:31.2730135Z T=16384, 2025-05-07T20:33:31.2730221Z D=7168, 2025-05-07T20:33:31.2730390Z scale_ub=1200.0, 2025-05-07T20:33:31.2730483Z contiguous=True, 2025-05-07T20:33:31.2730576Z compiled=True, 2025-05-07T20:33:31.2730656Z ) 2025-05-07T20:33:31.2730886Z self = 2025-05-07T20:33:31.2731077Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:31.2731082Z 2025-05-07T20:33:31.2731163Z @given( 2025-05-07T20:33:31.2731287Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2731393Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2731513Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2731638Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2731754Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2731836Z ) 2025-05-07T20:33:31.2732098Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2732199Z def test_silu_mul_quant( 2025-05-07T20:33:31.2732285Z self, 2025-05-07T20:33:31.2732370Z T: int, 2025-05-07T20:33:31.2732451Z D: int, 2025-05-07T20:33:31.2732563Z scale_ub: Optional[float], 2025-05-07T20:33:31.2732656Z contiguous: bool, 2025-05-07T20:33:31.2732745Z compiled: bool, 2025-05-07T20:33:31.2732832Z ) -> None: 2025-05-07T20:33:31.2732932Z torch.manual_seed(2025) 2025-05-07T20:33:31.2733011Z 2025-05-07T20:33:31.2733192Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2733274Z 2025-05-07T20:33:31.2733373Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2733509Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2733602Z x = x_sign * x_clamp 2025-05-07T20:33:31.2733693Z x0 = x[:, :D] 2025-05-07T20:33:31.2733786Z x1 = x[:, D:] 2025-05-07T20:33:31.2733863Z 2025-05-07T20:33:31.2733956Z if contiguous: 2025-05-07T20:33:31.2734056Z x0 = x0.contiguous() 2025-05-07T20:33:31.2734152Z x1 = x1.contiguous() 2025-05-07T20:33:31.2734238Z 2025-05-07T20:33:31.2734333Z if scale_ub is not None: 2025-05-07T20:33:31.2734442Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2734671Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2734753Z ) 2025-05-07T20:33:31.2734834Z else: 2025-05-07T20:33:31.2734936Z scale_ub_tensor = None 2025-05-07T20:33:31.2735066Z 2025-05-07T20:33:31.2735199Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2735299Z op = silu_mul_quant 2025-05-07T20:33:31.2735387Z if compiled: 2025-05-07T20:33:31.2735497Z op = torch.compile(op) 2025-05-07T20:33:31.2735649Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2735730Z 2025-05-07T20:33:31.2735832Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2735836Z 2025-05-07T20:33:31.2735940Z moe/activation_test.py:117: 2025-05-07T20:33:31.2736075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2736225Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2736328Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2736718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2736822Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2737351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2737456Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2737837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2738067Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2738470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2738574Z kernel = self.compile( 2025-05-07T20:33:31.2738985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2739165Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2739301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2739308Z 2025-05-07T20:33:31.2739522Z self = 2025-05-07T20:33:31.2740340Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2740864Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e45920c0>} 2025-05-07T20:33:31.2741663Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2741859Z context = 2025-05-07T20:33:31.2741863Z 2025-05-07T20:33:31.2742040Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2742319Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2742434Z module_map=module_map) 2025-05-07T20:33:31.2742599Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2742706Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2742793Z E ^ 2025-05-07T20:33:31.2743168Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2743175Z 2025-05-07T20:33:31.2743616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2743624Z 2025-05-07T20:33:31.2743733Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2743968Z self=, 2025-05-07T20:33:31.2744058Z T=16384, 2025-05-07T20:33:31.2744187Z D=5120, 2025-05-07T20:33:31.2744277Z scale_ub=1200.0, 2025-05-07T20:33:31.2744374Z contiguous=True, 2025-05-07T20:33:31.2744463Z compiled=False, 2025-05-07T20:33:31.2744542Z ) 2025-05-07T20:33:31.2744775Z self = 2025-05-07T20:33:31.2745003Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:31.2745008Z 2025-05-07T20:33:31.2745094Z @given( 2025-05-07T20:33:31.2745221Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2745365Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2745490Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2745614Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2745730Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2745812Z ) 2025-05-07T20:33:31.2746069Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2746169Z def test_silu_mul_quant( 2025-05-07T20:33:31.2746252Z self, 2025-05-07T20:33:31.2746333Z T: int, 2025-05-07T20:33:31.2746412Z D: int, 2025-05-07T20:33:31.2746519Z scale_ub: Optional[float], 2025-05-07T20:33:31.2746611Z contiguous: bool, 2025-05-07T20:33:31.2746707Z compiled: bool, 2025-05-07T20:33:31.2746790Z ) -> None: 2025-05-07T20:33:31.2746886Z torch.manual_seed(2025) 2025-05-07T20:33:31.2746962Z 2025-05-07T20:33:31.2747177Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2747257Z 2025-05-07T20:33:31.2747359Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2747486Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2747575Z x = x_sign * x_clamp 2025-05-07T20:33:31.2747658Z x0 = x[:, :D] 2025-05-07T20:33:31.2747737Z x1 = x[:, D:] 2025-05-07T20:33:31.2747810Z 2025-05-07T20:33:31.2747899Z if contiguous: 2025-05-07T20:33:31.2747994Z x0 = x0.contiguous() 2025-05-07T20:33:31.2748091Z x1 = x1.contiguous() 2025-05-07T20:33:31.2748164Z 2025-05-07T20:33:31.2748256Z if scale_ub is not None: 2025-05-07T20:33:31.2748365Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2748502Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2748582Z ) 2025-05-07T20:33:31.2748660Z else: 2025-05-07T20:33:31.2748757Z scale_ub_tensor = None 2025-05-07T20:33:31.2748830Z 2025-05-07T20:33:31.2748968Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2749060Z op = silu_mul_quant 2025-05-07T20:33:31.2749144Z if compiled: 2025-05-07T20:33:31.2749247Z op = torch.compile(op) 2025-05-07T20:33:31.2749353Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2749430Z 2025-05-07T20:33:31.2749525Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2749531Z 2025-05-07T20:33:31.2749627Z moe/activation_test.py:117: 2025-05-07T20:33:31.2749764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2749866Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2749966Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2750508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2750610Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2750994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2751226Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2751585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2751684Z kernel = self.compile( 2025-05-07T20:33:31.2752139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2752319Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2752457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2752501Z 2025-05-07T20:33:31.2752715Z self = 2025-05-07T20:33:31.2753536Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2754094Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4591a80>} 2025-05-07T20:33:31.2754895Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2755097Z context = 2025-05-07T20:33:31.2755101Z 2025-05-07T20:33:31.2755279Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2755559Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2755708Z module_map=module_map) 2025-05-07T20:33:31.2755880Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2755989Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2756068Z E ^ 2025-05-07T20:33:31.2756446Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2756450Z 2025-05-07T20:33:31.2756892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2756900Z 2025-05-07T20:33:31.2757007Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2757246Z self=, 2025-05-07T20:33:31.2757328Z T=1, 2025-05-07T20:33:31.2757414Z D=7168, 2025-05-07T20:33:31.2757503Z scale_ub=1200.0, 2025-05-07T20:33:31.2757593Z contiguous=False, 2025-05-07T20:33:31.2757685Z compiled=False, 2025-05-07T20:33:31.2757765Z ) 2025-05-07T20:33:31.2757992Z self = 2025-05-07T20:33:31.2758177Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:31.2758181Z 2025-05-07T20:33:31.2758263Z @given( 2025-05-07T20:33:31.2758385Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2758498Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2758617Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2758744Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2758862Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2758940Z ) 2025-05-07T20:33:31.2759202Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2759299Z def test_silu_mul_quant( 2025-05-07T20:33:31.2759380Z self, 2025-05-07T20:33:31.2759466Z T: int, 2025-05-07T20:33:31.2759548Z D: int, 2025-05-07T20:33:31.2759651Z scale_ub: Optional[float], 2025-05-07T20:33:31.2759751Z contiguous: bool, 2025-05-07T20:33:31.2759843Z compiled: bool, 2025-05-07T20:33:31.2759925Z ) -> None: 2025-05-07T20:33:31.2760026Z torch.manual_seed(2025) 2025-05-07T20:33:31.2760103Z 2025-05-07T20:33:31.2760279Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2760359Z 2025-05-07T20:33:31.2760458Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2760635Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2760728Z x = x_sign * x_clamp 2025-05-07T20:33:31.2760811Z x0 = x[:, :D] 2025-05-07T20:33:31.2760900Z x1 = x[:, D:] 2025-05-07T20:33:31.2760976Z 2025-05-07T20:33:31.2761103Z if contiguous: 2025-05-07T20:33:31.2761207Z x0 = x0.contiguous() 2025-05-07T20:33:31.2761306Z x1 = x1.contiguous() 2025-05-07T20:33:31.2761385Z 2025-05-07T20:33:31.2761487Z if scale_ub is not None: 2025-05-07T20:33:31.2761661Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2761800Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2761883Z ) 2025-05-07T20:33:31.2761964Z else: 2025-05-07T20:33:31.2762067Z scale_ub_tensor = None 2025-05-07T20:33:31.2762144Z 2025-05-07T20:33:31.2762278Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2762381Z op = silu_mul_quant 2025-05-07T20:33:31.2762471Z if compiled: 2025-05-07T20:33:31.2762575Z op = torch.compile(op) 2025-05-07T20:33:31.2762692Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2762769Z 2025-05-07T20:33:31.2762864Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2762871Z 2025-05-07T20:33:31.2762979Z moe/activation_test.py:117: 2025-05-07T20:33:31.2763157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2763265Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2763375Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2763905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2764011Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2764392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2764628Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2764996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2765094Z kernel = self.compile( 2025-05-07T20:33:31.2765512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2765697Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2765838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2765848Z 2025-05-07T20:33:31.2766064Z self = 2025-05-07T20:33:31.2766880Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2767407Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e41ac0e0>} 2025-05-07T20:33:31.2768204Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2768405Z context = 2025-05-07T20:33:31.2768414Z 2025-05-07T20:33:31.2768588Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2768864Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2768979Z module_map=module_map) 2025-05-07T20:33:31.2769147Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2769297Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2769388Z E ^ 2025-05-07T20:33:31.2769762Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2769766Z 2025-05-07T20:33:31.2770249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2770255Z 2025-05-07T20:33:31.2770369Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2770605Z self=, 2025-05-07T20:33:31.2770733Z T=4096, 2025-05-07T20:33:31.2770814Z D=7168, 2025-05-07T20:33:31.2770901Z scale_ub=1200.0, 2025-05-07T20:33:31.2770993Z contiguous=False, 2025-05-07T20:33:31.2771082Z compiled=True, 2025-05-07T20:33:31.2771162Z ) 2025-05-07T20:33:31.2771393Z self = 2025-05-07T20:33:31.2771579Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:31.2771584Z 2025-05-07T20:33:31.2771671Z @given( 2025-05-07T20:33:31.2771794Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2771898Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2772022Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2772144Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2772302Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2772385Z ) 2025-05-07T20:33:31.2772644Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2772741Z def test_silu_mul_quant( 2025-05-07T20:33:31.2772824Z self, 2025-05-07T20:33:31.2772902Z T: int, 2025-05-07T20:33:31.2772985Z D: int, 2025-05-07T20:33:31.2773091Z scale_ub: Optional[float], 2025-05-07T20:33:31.2773183Z contiguous: bool, 2025-05-07T20:33:31.2773280Z compiled: bool, 2025-05-07T20:33:31.2773362Z ) -> None: 2025-05-07T20:33:31.2773461Z torch.manual_seed(2025) 2025-05-07T20:33:31.2773545Z 2025-05-07T20:33:31.2773720Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2778105Z 2025-05-07T20:33:31.2778232Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2778366Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2778459Z x = x_sign * x_clamp 2025-05-07T20:33:31.2778549Z x0 = x[:, :D] 2025-05-07T20:33:31.2778628Z x1 = x[:, D:] 2025-05-07T20:33:31.2778706Z 2025-05-07T20:33:31.2778798Z if contiguous: 2025-05-07T20:33:31.2778893Z x0 = x0.contiguous() 2025-05-07T20:33:31.2778988Z x1 = x1.contiguous() 2025-05-07T20:33:31.2779064Z 2025-05-07T20:33:31.2779155Z if scale_ub is not None: 2025-05-07T20:33:31.2779271Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2779409Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2779485Z ) 2025-05-07T20:33:31.2779567Z else: 2025-05-07T20:33:31.2779662Z scale_ub_tensor = None 2025-05-07T20:33:31.2779738Z 2025-05-07T20:33:31.2779876Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2779971Z op = silu_mul_quant 2025-05-07T20:33:31.2780059Z if compiled: 2025-05-07T20:33:31.2780163Z op = torch.compile(op) 2025-05-07T20:33:31.2780272Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2780347Z 2025-05-07T20:33:31.2780441Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2780446Z 2025-05-07T20:33:31.2780542Z moe/activation_test.py:117: 2025-05-07T20:33:31.2780679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2780781Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2780882Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2781349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2781447Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2782011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2782120Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2782500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2782734Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2783135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2783233Z kernel = self.compile( 2025-05-07T20:33:31.2783639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2783822Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2783958Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2783969Z 2025-05-07T20:33:31.2784182Z self = 2025-05-07T20:33:31.2785040Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2785568Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e41ad300>} 2025-05-07T20:33:31.2786365Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2786570Z context = 2025-05-07T20:33:31.2786575Z 2025-05-07T20:33:31.2786746Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2787025Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2787140Z module_map=module_map) 2025-05-07T20:33:31.2787308Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2787416Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2787501Z E ^ 2025-05-07T20:33:31.2787874Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2787879Z 2025-05-07T20:33:31.2788321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2788326Z 2025-05-07T20:33:31.2788436Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2788667Z self=, 2025-05-07T20:33:31.2788753Z T=128, 2025-05-07T20:33:31.2788834Z D=7168, 2025-05-07T20:33:31.2788929Z scale_ub=1200.0, 2025-05-07T20:33:31.2789022Z contiguous=False, 2025-05-07T20:33:31.2789110Z compiled=True, 2025-05-07T20:33:31.2789192Z ) 2025-05-07T20:33:31.2789422Z self = 2025-05-07T20:33:31.2789601Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:31.2789607Z 2025-05-07T20:33:31.2789698Z @given( 2025-05-07T20:33:31.2789820Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2789924Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2790046Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2790166Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2790344Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2790424Z ) 2025-05-07T20:33:31.2790683Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2790785Z def test_silu_mul_quant( 2025-05-07T20:33:31.2790870Z self, 2025-05-07T20:33:31.2790994Z T: int, 2025-05-07T20:33:31.2791082Z D: int, 2025-05-07T20:33:31.2791186Z scale_ub: Optional[float], 2025-05-07T20:33:31.2791281Z contiguous: bool, 2025-05-07T20:33:31.2791379Z compiled: bool, 2025-05-07T20:33:31.2791502Z ) -> None: 2025-05-07T20:33:31.2791604Z torch.manual_seed(2025) 2025-05-07T20:33:31.2791688Z 2025-05-07T20:33:31.2791862Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2791946Z 2025-05-07T20:33:31.2792042Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2792170Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2792275Z x = x_sign * x_clamp 2025-05-07T20:33:31.2792359Z x0 = x[:, :D] 2025-05-07T20:33:31.2792442Z x1 = x[:, D:] 2025-05-07T20:33:31.2792522Z 2025-05-07T20:33:31.2792611Z if contiguous: 2025-05-07T20:33:31.2792710Z x0 = x0.contiguous() 2025-05-07T20:33:31.2792815Z x1 = x1.contiguous() 2025-05-07T20:33:31.2792890Z 2025-05-07T20:33:31.2792987Z if scale_ub is not None: 2025-05-07T20:33:31.2793099Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2793278Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2793369Z ) 2025-05-07T20:33:31.2793451Z else: 2025-05-07T20:33:31.2793547Z scale_ub_tensor = None 2025-05-07T20:33:31.2793630Z 2025-05-07T20:33:31.2793763Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2793855Z op = silu_mul_quant 2025-05-07T20:33:31.2793954Z if compiled: 2025-05-07T20:33:31.2794061Z op = torch.compile(op) 2025-05-07T20:33:31.2794170Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2794256Z 2025-05-07T20:33:31.2794350Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2794354Z 2025-05-07T20:33:31.2794454Z moe/activation_test.py:117: 2025-05-07T20:33:31.2794598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2794704Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2794819Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2795261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2795361Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2795898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2796000Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2796381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2796617Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2796978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2797087Z kernel = self.compile( 2025-05-07T20:33:31.2797494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2797674Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2797811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2797815Z 2025-05-07T20:33:31.2798037Z self = 2025-05-07T20:33:31.2798859Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2799423Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e41ae020>} 2025-05-07T20:33:31.2800263Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2800502Z context = 2025-05-07T20:33:31.2800507Z 2025-05-07T20:33:31.2800682Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2800959Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2801069Z module_map=module_map) 2025-05-07T20:33:31.2801241Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2801349Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2801432Z E ^ 2025-05-07T20:33:31.2801808Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2801816Z 2025-05-07T20:33:31.2802257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2802301Z 2025-05-07T20:33:31.2802409Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2802650Z self=, 2025-05-07T20:33:31.2802732Z T=2048, 2025-05-07T20:33:31.2802815Z D=7168, 2025-05-07T20:33:31.2802905Z scale_ub=None, 2025-05-07T20:33:31.2802995Z contiguous=True, 2025-05-07T20:33:31.2803085Z compiled=True, 2025-05-07T20:33:31.2803162Z ) 2025-05-07T20:33:31.2803388Z self = 2025-05-07T20:33:31.2803571Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:31.2803575Z 2025-05-07T20:33:31.2803656Z @given( 2025-05-07T20:33:31.2803780Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2803888Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2804008Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2804131Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2804257Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2804340Z ) 2025-05-07T20:33:31.2804619Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2804726Z def test_silu_mul_quant( 2025-05-07T20:33:31.2804825Z self, 2025-05-07T20:33:31.2804911Z T: int, 2025-05-07T20:33:31.2804993Z D: int, 2025-05-07T20:33:31.2805095Z scale_ub: Optional[float], 2025-05-07T20:33:31.2805194Z contiguous: bool, 2025-05-07T20:33:31.2805287Z compiled: bool, 2025-05-07T20:33:31.2805371Z ) -> None: 2025-05-07T20:33:31.2805478Z torch.manual_seed(2025) 2025-05-07T20:33:31.2805555Z 2025-05-07T20:33:31.2805733Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2805818Z 2025-05-07T20:33:31.2805914Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2806048Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2806144Z x = x_sign * x_clamp 2025-05-07T20:33:31.2806230Z x0 = x[:, :D] 2025-05-07T20:33:31.2806321Z x1 = x[:, D:] 2025-05-07T20:33:31.2806397Z 2025-05-07T20:33:31.2806484Z if contiguous: 2025-05-07T20:33:31.2806584Z x0 = x0.contiguous() 2025-05-07T20:33:31.2806677Z x1 = x1.contiguous() 2025-05-07T20:33:31.2806753Z 2025-05-07T20:33:31.2806854Z if scale_ub is not None: 2025-05-07T20:33:31.2806963Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2807149Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2807232Z ) 2025-05-07T20:33:31.2807312Z else: 2025-05-07T20:33:31.2807413Z scale_ub_tensor = None 2025-05-07T20:33:31.2807489Z 2025-05-07T20:33:31.2807664Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2807767Z op = silu_mul_quant 2025-05-07T20:33:31.2807856Z if compiled: 2025-05-07T20:33:31.2807961Z op = torch.compile(op) 2025-05-07T20:33:31.2808136Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2808215Z 2025-05-07T20:33:31.2808310Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2808314Z 2025-05-07T20:33:31.2808418Z moe/activation_test.py:117: 2025-05-07T20:33:31.2808554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2808658Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2808769Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2809160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2809265Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2809792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2809895Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2810317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2810554Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2810922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2811020Z kernel = self.compile( 2025-05-07T20:33:31.2811427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2811612Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2811749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2811754Z 2025-05-07T20:33:31.2811968Z self = 2025-05-07T20:33:31.2812794Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2813314Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e41af240>} 2025-05-07T20:33:31.2814112Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2814313Z context = 2025-05-07T20:33:31.2814317Z 2025-05-07T20:33:31.2814589Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2814869Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2814975Z module_map=module_map) 2025-05-07T20:33:31.2815149Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2815253Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2815329Z E ^ 2025-05-07T20:33:31.2815703Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2815707Z 2025-05-07T20:33:31.2816147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2816198Z 2025-05-07T20:33:31.2816310Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2816541Z self=, 2025-05-07T20:33:31.2816621Z T=16384, 2025-05-07T20:33:31.2816706Z D=5120, 2025-05-07T20:33:31.2816829Z scale_ub=None, 2025-05-07T20:33:31.2816918Z contiguous=False, 2025-05-07T20:33:31.2817008Z compiled=False, 2025-05-07T20:33:31.2817079Z ) 2025-05-07T20:33:31.2817308Z self = 2025-05-07T20:33:31.2817533Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:31.2817537Z 2025-05-07T20:33:31.2817619Z @given( 2025-05-07T20:33:31.2817745Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2817847Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2817964Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2818089Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2818206Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2818279Z ) 2025-05-07T20:33:31.2818536Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2818634Z def test_silu_mul_quant( 2025-05-07T20:33:31.2818715Z self, 2025-05-07T20:33:31.2818793Z T: int, 2025-05-07T20:33:31.2818870Z D: int, 2025-05-07T20:33:31.2819012Z scale_ub: Optional[float], 2025-05-07T20:33:31.2819107Z contiguous: bool, 2025-05-07T20:33:31.2819199Z compiled: bool, 2025-05-07T20:33:31.2819284Z ) -> None: 2025-05-07T20:33:31.2819377Z torch.manual_seed(2025) 2025-05-07T20:33:31.2819449Z 2025-05-07T20:33:31.2819626Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2819701Z 2025-05-07T20:33:31.2819796Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2819923Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2821879Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.2821890Z 2025-05-07T20:33:31.2822008Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:31.2822013Z 2025-05-07T20:33:31.2822117Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2822352Z self=, 2025-05-07T20:33:31.2822433Z T=4096, 2025-05-07T20:33:31.2822514Z D=7168, 2025-05-07T20:33:31.2822605Z scale_ub=1200.0, 2025-05-07T20:33:31.2822697Z contiguous=True, 2025-05-07T20:33:31.2822786Z compiled=True, 2025-05-07T20:33:31.2822868Z ) 2025-05-07T20:33:31.2823092Z self = 2025-05-07T20:33:31.2823271Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:31.2823278Z 2025-05-07T20:33:31.2823360Z @given( 2025-05-07T20:33:31.2823484Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2823590Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2823709Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2823829Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2823950Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2824025Z ) 2025-05-07T20:33:31.2824281Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2824433Z def test_silu_mul_quant( 2025-05-07T20:33:31.2824513Z self, 2025-05-07T20:33:31.2824594Z T: int, 2025-05-07T20:33:31.2824674Z D: int, 2025-05-07T20:33:31.2824775Z scale_ub: Optional[float], 2025-05-07T20:33:31.2824870Z contiguous: bool, 2025-05-07T20:33:31.2824997Z compiled: bool, 2025-05-07T20:33:31.2825084Z ) -> None: 2025-05-07T20:33:31.2825189Z torch.manual_seed(2025) 2025-05-07T20:33:31.2825262Z 2025-05-07T20:33:31.2825672Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2825980Z 2025-05-07T20:33:31.2826077Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2826206Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2828146Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.2828155Z 2025-05-07T20:33:31.2828276Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:31.2828280Z 2025-05-07T20:33:31.2828453Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2828684Z self=, 2025-05-07T20:33:31.2828774Z T=16384, 2025-05-07T20:33:31.2828857Z D=7168, 2025-05-07T20:33:31.2828945Z scale_ub=None, 2025-05-07T20:33:31.2829036Z contiguous=False, 2025-05-07T20:33:31.2829126Z compiled=False, 2025-05-07T20:33:31.2829201Z ) 2025-05-07T20:33:31.2829433Z self = 2025-05-07T20:33:31.2829617Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:31.2829622Z 2025-05-07T20:33:31.2829703Z @given( 2025-05-07T20:33:31.2829833Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2829937Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2830057Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2830181Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2830301Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2830385Z ) 2025-05-07T20:33:31.2830641Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2830735Z def test_silu_mul_quant( 2025-05-07T20:33:31.2830816Z self, 2025-05-07T20:33:31.2830894Z T: int, 2025-05-07T20:33:31.2830973Z D: int, 2025-05-07T20:33:31.2831077Z scale_ub: Optional[float], 2025-05-07T20:33:31.2831167Z contiguous: bool, 2025-05-07T20:33:31.2831259Z compiled: bool, 2025-05-07T20:33:31.2831344Z ) -> None: 2025-05-07T20:33:31.2831437Z torch.manual_seed(2025) 2025-05-07T20:33:31.2831512Z 2025-05-07T20:33:31.2831680Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2833610Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.2833625Z 2025-05-07T20:33:31.2833740Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:31.2833814Z 2025-05-07T20:33:31.2833915Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2834145Z self=, 2025-05-07T20:33:31.2834222Z T=2048, 2025-05-07T20:33:31.2834300Z D=7168, 2025-05-07T20:33:31.2834383Z scale_ub=1200.0, 2025-05-07T20:33:31.2834524Z contiguous=True, 2025-05-07T20:33:31.2834611Z compiled=True, 2025-05-07T20:33:31.2834688Z ) 2025-05-07T20:33:31.2834909Z self = 2025-05-07T20:33:31.2835130Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:31.2835135Z 2025-05-07T20:33:31.2835208Z @given( 2025-05-07T20:33:31.2835322Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2835424Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2835535Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2835648Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2835764Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2835837Z ) 2025-05-07T20:33:31.2836086Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2836183Z def test_silu_mul_quant( 2025-05-07T20:33:31.2836263Z self, 2025-05-07T20:33:31.2836340Z T: int, 2025-05-07T20:33:31.2836417Z D: int, 2025-05-07T20:33:31.2836512Z scale_ub: Optional[float], 2025-05-07T20:33:31.2836641Z contiguous: bool, 2025-05-07T20:33:31.2836729Z compiled: bool, 2025-05-07T20:33:31.2836804Z ) -> None: 2025-05-07T20:33:31.2836898Z torch.manual_seed(2025) 2025-05-07T20:33:31.2836969Z 2025-05-07T20:33:31.2837135Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2837212Z 2025-05-07T20:33:31.2837304Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2837424Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2839342Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.2839351Z 2025-05-07T20:33:31.2839470Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:31.2839474Z 2025-05-07T20:33:31.2839575Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2839803Z self=, 2025-05-07T20:33:31.2839884Z T=2048, 2025-05-07T20:33:31.2839960Z D=7168, 2025-05-07T20:33:31.2840044Z scale_ub=None, 2025-05-07T20:33:31.2840133Z contiguous=True, 2025-05-07T20:33:31.2840216Z compiled=False, 2025-05-07T20:33:31.2840293Z ) 2025-05-07T20:33:31.2840520Z self = 2025-05-07T20:33:31.2840696Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:31.2840700Z 2025-05-07T20:33:31.2840776Z @given( 2025-05-07T20:33:31.2840893Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2840990Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2841109Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2841225Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2841337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2841410Z ) 2025-05-07T20:33:31.2841665Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2841756Z def test_silu_mul_quant( 2025-05-07T20:33:31.2841876Z self, 2025-05-07T20:33:31.2841955Z T: int, 2025-05-07T20:33:31.2842036Z D: int, 2025-05-07T20:33:31.2842132Z scale_ub: Optional[float], 2025-05-07T20:33:31.2842219Z contiguous: bool, 2025-05-07T20:33:31.2842306Z compiled: bool, 2025-05-07T20:33:31.2842448Z ) -> None: 2025-05-07T20:33:31.2842540Z torch.manual_seed(2025) 2025-05-07T20:33:31.2842616Z 2025-05-07T20:33:31.2842786Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2842859Z 2025-05-07T20:33:31.2843001Z > x_sign = torch.sign(x) 2025-05-07T20:33:31.2844913Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.2844922Z 2025-05-07T20:33:31.2845040Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:31.2845045Z 2025-05-07T20:33:31.2845145Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2845415Z self=, 2025-05-07T20:33:31.2845489Z T=1, 2025-05-07T20:33:31.2845564Z D=7168, 2025-05-07T20:33:31.2845648Z scale_ub=1200.0, 2025-05-07T20:33:31.2845732Z contiguous=True, 2025-05-07T20:33:31.2845815Z compiled=False, 2025-05-07T20:33:31.2845892Z ) 2025-05-07T20:33:31.2846111Z self = 2025-05-07T20:33:31.2846276Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:31.2846283Z 2025-05-07T20:33:31.2846364Z @given( 2025-05-07T20:33:31.2846482Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2846580Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2846691Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2846807Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2846921Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2846992Z ) 2025-05-07T20:33:31.2847243Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2847341Z def test_silu_mul_quant( 2025-05-07T20:33:31.2847419Z self, 2025-05-07T20:33:31.2847494Z T: int, 2025-05-07T20:33:31.2847569Z D: int, 2025-05-07T20:33:31.2847666Z scale_ub: Optional[float], 2025-05-07T20:33:31.2847756Z contiguous: bool, 2025-05-07T20:33:31.2847837Z compiled: bool, 2025-05-07T20:33:31.2847914Z ) -> None: 2025-05-07T20:33:31.2848013Z torch.manual_seed(2025) 2025-05-07T20:33:31.2848087Z 2025-05-07T20:33:31.2848255Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2848333Z 2025-05-07T20:33:31.2848424Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2848549Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2848640Z x = x_sign * x_clamp 2025-05-07T20:33:31.2848715Z x0 = x[:, :D] 2025-05-07T20:33:31.2848792Z x1 = x[:, D:] 2025-05-07T20:33:31.2848872Z 2025-05-07T20:33:31.2848957Z if contiguous: 2025-05-07T20:33:31.2849049Z x0 = x0.contiguous() 2025-05-07T20:33:31.2849139Z x1 = x1.contiguous() 2025-05-07T20:33:31.2849209Z 2025-05-07T20:33:31.2849300Z if scale_ub is not None: 2025-05-07T20:33:31.2849403Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2849535Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2849665Z ) 2025-05-07T20:33:31.2849737Z else: 2025-05-07T20:33:31.2849831Z scale_ub_tensor = None 2025-05-07T20:33:31.2849904Z 2025-05-07T20:33:31.2850033Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2850121Z op = silu_mul_quant 2025-05-07T20:33:31.2850250Z if compiled: 2025-05-07T20:33:31.2850350Z op = torch.compile(op) 2025-05-07T20:33:31.2850452Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2850532Z 2025-05-07T20:33:31.2850623Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2850666Z 2025-05-07T20:33:31.2850771Z moe/activation_test.py:117: 2025-05-07T20:33:31.2850906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2851011Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2851116Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2851641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2851739Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2852123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2852353Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2852714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2852847Z kernel = self.compile( 2025-05-07T20:33:31.2853255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2853437Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2853567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2853571Z 2025-05-07T20:33:31.2853789Z self = 2025-05-07T20:33:31.2854670Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2855191Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4296520>} 2025-05-07T20:33:31.2855994Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2856192Z context = 2025-05-07T20:33:31.2856197Z 2025-05-07T20:33:31.2856371Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2856647Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2856757Z module_map=module_map) 2025-05-07T20:33:31.2856927Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2857030Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2857112Z E ^ 2025-05-07T20:33:31.2857485Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2857492Z 2025-05-07T20:33:31.2857935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2857942Z 2025-05-07T20:33:31.2858049Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2858283Z self=, 2025-05-07T20:33:31.2858363Z T=128, 2025-05-07T20:33:31.2858445Z D=5120, 2025-05-07T20:33:31.2858535Z scale_ub=None, 2025-05-07T20:33:31.2858674Z contiguous=True, 2025-05-07T20:33:31.2858759Z compiled=False, 2025-05-07T20:33:31.2858835Z ) 2025-05-07T20:33:31.2859065Z self = 2025-05-07T20:33:31.2859242Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:31.2859285Z 2025-05-07T20:33:31.2859363Z @given( 2025-05-07T20:33:31.2859487Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2859595Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2859710Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2859877Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2859994Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2860070Z ) 2025-05-07T20:33:31.2860323Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2860419Z def test_silu_mul_quant( 2025-05-07T20:33:31.2860508Z self, 2025-05-07T20:33:31.2860587Z T: int, 2025-05-07T20:33:31.2860666Z D: int, 2025-05-07T20:33:31.2860766Z scale_ub: Optional[float], 2025-05-07T20:33:31.2860856Z contiguous: bool, 2025-05-07T20:33:31.2860941Z compiled: bool, 2025-05-07T20:33:31.2861023Z ) -> None: 2025-05-07T20:33:31.2861123Z torch.manual_seed(2025) 2025-05-07T20:33:31.2861197Z 2025-05-07T20:33:31.2861372Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2861489Z 2025-05-07T20:33:31.2861588Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2861718Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2861806Z x = x_sign * x_clamp 2025-05-07T20:33:31.2861896Z x0 = x[:, :D] 2025-05-07T20:33:31.2861975Z x1 = x[:, D:] 2025-05-07T20:33:31.2862049Z 2025-05-07T20:33:31.2862136Z if contiguous: 2025-05-07T20:33:31.2862230Z x0 = x0.contiguous() 2025-05-07T20:33:31.2862327Z x1 = x1.contiguous() 2025-05-07T20:33:31.2862407Z 2025-05-07T20:33:31.2862500Z if scale_ub is not None: 2025-05-07T20:33:31.2862610Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2862751Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2862829Z ) 2025-05-07T20:33:31.2862910Z else: 2025-05-07T20:33:31.2863011Z scale_ub_tensor = None 2025-05-07T20:33:31.2863085Z 2025-05-07T20:33:31.2863226Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2863320Z op = silu_mul_quant 2025-05-07T20:33:31.2863407Z if compiled: 2025-05-07T20:33:31.2863514Z op = torch.compile(op) 2025-05-07T20:33:31.2863619Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2863691Z 2025-05-07T20:33:31.2863788Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2863792Z 2025-05-07T20:33:31.2863891Z moe/activation_test.py:117: 2025-05-07T20:33:31.2864029Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2864134Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2864238Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2864771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2864875Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2865258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2865498Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2865857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2865956Z kernel = self.compile( 2025-05-07T20:33:31.2866362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2866587Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2866723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2866728Z 2025-05-07T20:33:31.2866978Z self = 2025-05-07T20:33:31.2867796Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2868359Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e4297420>} 2025-05-07T20:33:31.2869154Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2869361Z context = 2025-05-07T20:33:31.2869365Z 2025-05-07T20:33:31.2869536Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2869815Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2869922Z module_map=module_map) 2025-05-07T20:33:31.2870126Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2870238Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2870317Z E ^ 2025-05-07T20:33:31.2870689Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2870694Z 2025-05-07T20:33:31.2871136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2871144Z 2025-05-07T20:33:31.2871250Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2871483Z self=, 2025-05-07T20:33:31.2871563Z T=128, 2025-05-07T20:33:31.2871641Z D=7168, 2025-05-07T20:33:31.2871728Z scale_ub=None, 2025-05-07T20:33:31.2871818Z contiguous=True, 2025-05-07T20:33:31.2871901Z compiled=False, 2025-05-07T20:33:31.2871984Z ) 2025-05-07T20:33:31.2872213Z self = 2025-05-07T20:33:31.2872392Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:31.2872401Z 2025-05-07T20:33:31.2872482Z @given( 2025-05-07T20:33:31.2872602Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2872708Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2872823Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2872944Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2873066Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2873141Z ) 2025-05-07T20:33:31.2873395Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2873494Z def test_silu_mul_quant( 2025-05-07T20:33:31.2873572Z self, 2025-05-07T20:33:31.2873654Z T: int, 2025-05-07T20:33:31.2873731Z D: int, 2025-05-07T20:33:31.2873833Z scale_ub: Optional[float], 2025-05-07T20:33:31.2873930Z contiguous: bool, 2025-05-07T20:33:31.2874021Z compiled: bool, 2025-05-07T20:33:31.2874099Z ) -> None: 2025-05-07T20:33:31.2874198Z torch.manual_seed(2025) 2025-05-07T20:33:31.2874277Z 2025-05-07T20:33:31.2874474Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2874566Z 2025-05-07T20:33:31.2874673Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2874798Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2874967Z x = x_sign * x_clamp 2025-05-07T20:33:31.2875050Z x0 = x[:, :D] 2025-05-07T20:33:31.2875131Z x1 = x[:, D:] 2025-05-07T20:33:31.2875209Z 2025-05-07T20:33:31.2875295Z if contiguous: 2025-05-07T20:33:31.2875393Z x0 = x0.contiguous() 2025-05-07T20:33:31.2875525Z x1 = x1.contiguous() 2025-05-07T20:33:31.2875599Z 2025-05-07T20:33:31.2875696Z if scale_ub is not None: 2025-05-07T20:33:31.2875806Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2875944Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2876066Z ) 2025-05-07T20:33:31.2876146Z else: 2025-05-07T20:33:31.2876241Z scale_ub_tensor = None 2025-05-07T20:33:31.2876318Z 2025-05-07T20:33:31.2876450Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2876542Z op = silu_mul_quant 2025-05-07T20:33:31.2876633Z if compiled: 2025-05-07T20:33:31.2876737Z op = torch.compile(op) 2025-05-07T20:33:31.2876850Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2876923Z 2025-05-07T20:33:31.2877015Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2877020Z 2025-05-07T20:33:31.2877121Z moe/activation_test.py:117: 2025-05-07T20:33:31.2877259Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2877361Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2877509Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2878042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2878141Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2878525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2878754Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2879116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2879212Z kernel = self.compile( 2025-05-07T20:33:31.2879619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2879804Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2879938Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2879945Z 2025-05-07T20:33:31.2880160Z self = 2025-05-07T20:33:31.2880973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2881493Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e44dc4a0>} 2025-05-07T20:33:31.2882293Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2882490Z context = 2025-05-07T20:33:31.2882498Z 2025-05-07T20:33:31.2882674Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2882949Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2883057Z module_map=module_map) 2025-05-07T20:33:31.2883228Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2883331Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2883453Z E ^ 2025-05-07T20:33:31.2883826Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2883831Z 2025-05-07T20:33:31.2884303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2884308Z 2025-05-07T20:33:31.2884435Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2884698Z self=, 2025-05-07T20:33:31.2884783Z T=2048, 2025-05-07T20:33:31.2884900Z D=7168, 2025-05-07T20:33:31.2884986Z scale_ub=1200.0, 2025-05-07T20:33:31.2885076Z contiguous=True, 2025-05-07T20:33:31.2885161Z compiled=False, 2025-05-07T20:33:31.2885238Z ) 2025-05-07T20:33:31.2885466Z self = 2025-05-07T20:33:31.2885643Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:31.2885650Z 2025-05-07T20:33:31.2885728Z @given( 2025-05-07T20:33:31.2885849Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2885949Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2886068Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2886192Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2886308Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2886384Z ) 2025-05-07T20:33:31.2886681Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2886783Z def test_silu_mul_quant( 2025-05-07T20:33:31.2886866Z self, 2025-05-07T20:33:31.2886942Z T: int, 2025-05-07T20:33:31.2887021Z D: int, 2025-05-07T20:33:31.2887130Z scale_ub: Optional[float], 2025-05-07T20:33:31.2887220Z contiguous: bool, 2025-05-07T20:33:31.2887307Z compiled: bool, 2025-05-07T20:33:31.2887396Z ) -> None: 2025-05-07T20:33:31.2887494Z torch.manual_seed(2025) 2025-05-07T20:33:31.2887573Z 2025-05-07T20:33:31.2887747Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2889675Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.2889686Z 2025-05-07T20:33:31.2889805Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:31.2889810Z 2025-05-07T20:33:31.2889914Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2890151Z self=, 2025-05-07T20:33:31.2890232Z T=1, 2025-05-07T20:33:31.2890314Z D=5120, 2025-05-07T20:33:31.2890402Z scale_ub=1200.0, 2025-05-07T20:33:31.2890490Z contiguous=True, 2025-05-07T20:33:31.2890578Z compiled=False, 2025-05-07T20:33:31.2890659Z ) 2025-05-07T20:33:31.2890886Z self = 2025-05-07T20:33:31.2891060Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:31.2891064Z 2025-05-07T20:33:31.2891145Z @given( 2025-05-07T20:33:31.2891263Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2891368Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2891483Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2891599Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2891715Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2891838Z ) 2025-05-07T20:33:31.2892095Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2892195Z def test_silu_mul_quant( 2025-05-07T20:33:31.2892273Z self, 2025-05-07T20:33:31.2892354Z T: int, 2025-05-07T20:33:31.2892432Z D: int, 2025-05-07T20:33:31.2892569Z scale_ub: Optional[float], 2025-05-07T20:33:31.2892666Z contiguous: bool, 2025-05-07T20:33:31.2892752Z compiled: bool, 2025-05-07T20:33:31.2892835Z ) -> None: 2025-05-07T20:33:31.2892935Z torch.manual_seed(2025) 2025-05-07T20:33:31.2893050Z 2025-05-07T20:33:31.2893224Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2893305Z 2025-05-07T20:33:31.2893401Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2893529Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2893623Z x = x_sign * x_clamp 2025-05-07T20:33:31.2893704Z x0 = x[:, :D] 2025-05-07T20:33:31.2893792Z x1 = x[:, D:] 2025-05-07T20:33:31.2893866Z 2025-05-07T20:33:31.2893950Z if contiguous: 2025-05-07T20:33:31.2894046Z x0 = x0.contiguous() 2025-05-07T20:33:31.2894135Z x1 = x1.contiguous() 2025-05-07T20:33:31.2894211Z 2025-05-07T20:33:31.2894308Z if scale_ub is not None: 2025-05-07T20:33:31.2894414Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2894653Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2894780Z ) 2025-05-07T20:33:31.2894860Z else: 2025-05-07T20:33:31.2894958Z scale_ub_tensor = None 2025-05-07T20:33:31.2895036Z 2025-05-07T20:33:31.2895170Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2895265Z op = silu_mul_quant 2025-05-07T20:33:31.2895356Z if compiled: 2025-05-07T20:33:31.2895459Z op = torch.compile(op) 2025-05-07T20:33:31.2895570Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2895647Z 2025-05-07T20:33:31.2895741Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2895746Z 2025-05-07T20:33:31.2895849Z moe/activation_test.py:117: 2025-05-07T20:33:31.2895982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2896085Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2896192Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2896722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2896832Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2897211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2897445Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2897808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2897907Z kernel = self.compile( 2025-05-07T20:33:31.2898310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2898496Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2898629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2898634Z 2025-05-07T20:33:31.2898849Z self = 2025-05-07T20:33:31.2899668Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2900187Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f08e44dda80>} 2025-05-07T20:33:31.2901032Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2901265Z context = 2025-05-07T20:33:31.2901270Z 2025-05-07T20:33:31.2901448Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2901724Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2901877Z module_map=module_map) 2025-05-07T20:33:31.2902040Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2902143Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2902223Z E ^ 2025-05-07T20:33:31.2902593Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2902601Z 2025-05-07T20:33:31.2903037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2903042Z 2025-05-07T20:33:31.2903148Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2903381Z self=, 2025-05-07T20:33:31.2903460Z T=2048, 2025-05-07T20:33:31.2903536Z D=5120, 2025-05-07T20:33:31.2903685Z scale_ub=None, 2025-05-07T20:33:31.2903777Z contiguous=True, 2025-05-07T20:33:31.2903864Z compiled=False, 2025-05-07T20:33:31.2903939Z ) 2025-05-07T20:33:31.2904170Z self = 2025-05-07T20:33:31.2904346Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:31.2904350Z 2025-05-07T20:33:31.2904427Z @given( 2025-05-07T20:33:31.2908658Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2908788Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2908912Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2909039Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2909156Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2909244Z ) 2025-05-07T20:33:31.2909503Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2909606Z def test_silu_mul_quant( 2025-05-07T20:33:31.2909694Z self, 2025-05-07T20:33:31.2909781Z T: int, 2025-05-07T20:33:31.2909861Z D: int, 2025-05-07T20:33:31.2909968Z scale_ub: Optional[float], 2025-05-07T20:33:31.2910066Z contiguous: bool, 2025-05-07T20:33:31.2910154Z compiled: bool, 2025-05-07T20:33:31.2910243Z ) -> None: 2025-05-07T20:33:31.2910340Z torch.manual_seed(2025) 2025-05-07T20:33:31.2910412Z 2025-05-07T20:33:31.2910593Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2910669Z 2025-05-07T20:33:31.2910765Z > x_sign = torch.sign(x) 2025-05-07T20:33:31.2912693Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.2912701Z 2025-05-07T20:33:31.2912822Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:31.2912826Z 2025-05-07T20:33:31.2912929Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2913161Z self=, 2025-05-07T20:33:31.2913314Z T=16384, 2025-05-07T20:33:31.2913396Z D=5120, 2025-05-07T20:33:31.2913484Z scale_ub=None, 2025-05-07T20:33:31.2913580Z contiguous=True, 2025-05-07T20:33:31.2913669Z compiled=False, 2025-05-07T20:33:31.2913748Z ) 2025-05-07T20:33:31.2914022Z self = 2025-05-07T20:33:31.2914207Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:31.2914214Z 2025-05-07T20:33:31.2914302Z @given( 2025-05-07T20:33:31.2914465Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2914569Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2914692Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2914810Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2914926Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2915010Z ) 2025-05-07T20:33:31.2915272Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2915372Z def test_silu_mul_quant( 2025-05-07T20:33:31.2915455Z self, 2025-05-07T20:33:31.2915539Z T: int, 2025-05-07T20:33:31.2915619Z D: int, 2025-05-07T20:33:31.2915727Z scale_ub: Optional[float], 2025-05-07T20:33:31.2915820Z contiguous: bool, 2025-05-07T20:33:31.2915916Z compiled: bool, 2025-05-07T20:33:31.2915998Z ) -> None: 2025-05-07T20:33:31.2916137Z torch.manual_seed(2025) 2025-05-07T20:33:31.2916218Z 2025-05-07T20:33:31.2916391Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2918301Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.2918313Z 2025-05-07T20:33:31.2918434Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:31.2918438Z 2025-05-07T20:33:31.2918540Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2918773Z self=, 2025-05-07T20:33:31.2918855Z T=4096, 2025-05-07T20:33:31.2918933Z D=5120, 2025-05-07T20:33:31.2919024Z scale_ub=None, 2025-05-07T20:33:31.2919108Z contiguous=True, 2025-05-07T20:33:31.2919196Z compiled=False, 2025-05-07T20:33:31.2919271Z ) 2025-05-07T20:33:31.2919491Z self = 2025-05-07T20:33:31.2919671Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:31.2919678Z 2025-05-07T20:33:31.2919757Z @given( 2025-05-07T20:33:31.2919875Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2919976Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2920095Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2920210Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2920332Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2920409Z ) 2025-05-07T20:33:31.2920667Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2920763Z def test_silu_mul_quant( 2025-05-07T20:33:31.2920840Z self, 2025-05-07T20:33:31.2920923Z T: int, 2025-05-07T20:33:31.2920999Z D: int, 2025-05-07T20:33:31.2921098Z scale_ub: Optional[float], 2025-05-07T20:33:31.2921191Z contiguous: bool, 2025-05-07T20:33:31.2921278Z compiled: bool, 2025-05-07T20:33:31.2921408Z ) -> None: 2025-05-07T20:33:31.2921511Z torch.manual_seed(2025) 2025-05-07T20:33:31.2921584Z 2025-05-07T20:33:31.2921754Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2923711Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.2923754Z 2025-05-07T20:33:31.2923878Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:31.2923882Z 2025-05-07T20:33:31.2923989Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2924225Z self=, 2025-05-07T20:33:31.2924310Z T=2048, 2025-05-07T20:33:31.2924388Z D=5120, 2025-05-07T20:33:31.2924469Z scale_ub=None, 2025-05-07T20:33:31.2924562Z contiguous=False, 2025-05-07T20:33:31.2924653Z compiled=False, 2025-05-07T20:33:31.2924732Z ) 2025-05-07T20:33:31.2924956Z self = 2025-05-07T20:33:31.2925173Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:31.2925180Z 2025-05-07T20:33:31.2925263Z @given( 2025-05-07T20:33:31.2925382Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2925727Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2925886Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2926002Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2926119Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2926203Z ) 2025-05-07T20:33:31.2926457Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2926556Z def test_silu_mul_quant( 2025-05-07T20:33:31.2926635Z self, 2025-05-07T20:33:31.2926712Z T: int, 2025-05-07T20:33:31.2926790Z D: int, 2025-05-07T20:33:31.2926889Z scale_ub: Optional[float], 2025-05-07T20:33:31.2926974Z contiguous: bool, 2025-05-07T20:33:31.2927069Z compiled: bool, 2025-05-07T20:33:31.2927146Z ) -> None: 2025-05-07T20:33:31.2927244Z torch.manual_seed(2025) 2025-05-07T20:33:31.2927325Z 2025-05-07T20:33:31.2927495Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2929403Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.2929415Z 2025-05-07T20:33:31.2929529Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:31.2929534Z 2025-05-07T20:33:31.2929635Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2929866Z self=, 2025-05-07T20:33:31.2929945Z T=4096, 2025-05-07T20:33:31.2930027Z D=7168, 2025-05-07T20:33:31.2930117Z scale_ub=None, 2025-05-07T20:33:31.2930205Z contiguous=True, 2025-05-07T20:33:31.2930296Z compiled=True, 2025-05-07T20:33:31.2930371Z ) 2025-05-07T20:33:31.2930594Z self = 2025-05-07T20:33:31.2930856Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:31.2930860Z 2025-05-07T20:33:31.2930936Z @given( 2025-05-07T20:33:31.2931048Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2931148Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2931317Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2931429Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2931544Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2931675Z ) 2025-05-07T20:33:31.2931929Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2932021Z def test_silu_mul_quant( 2025-05-07T20:33:31.2932100Z self, 2025-05-07T20:33:31.2932178Z T: int, 2025-05-07T20:33:31.2932252Z D: int, 2025-05-07T20:33:31.2932349Z scale_ub: Optional[float], 2025-05-07T20:33:31.2932441Z contiguous: bool, 2025-05-07T20:33:31.2932529Z compiled: bool, 2025-05-07T20:33:31.2932604Z ) -> None: 2025-05-07T20:33:31.2932697Z torch.manual_seed(2025) 2025-05-07T20:33:31.2932769Z 2025-05-07T20:33:31.2932936Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2935021Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.2935032Z 2025-05-07T20:33:31.2935174Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:31.2935181Z 2025-05-07T20:33:31.2935280Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2935505Z self=, 2025-05-07T20:33:31.2935586Z T=2048, 2025-05-07T20:33:31.2935665Z D=5120, 2025-05-07T20:33:31.2935746Z scale_ub=1200.0, 2025-05-07T20:33:31.2935836Z contiguous=False, 2025-05-07T20:33:31.2935921Z compiled=False, 2025-05-07T20:33:31.2935992Z ) 2025-05-07T20:33:31.2936215Z self = 2025-05-07T20:33:31.2936392Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:31.2936399Z 2025-05-07T20:33:31.2936479Z @given( 2025-05-07T20:33:31.2936592Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2936689Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2936810Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2936926Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2937039Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2937117Z ) 2025-05-07T20:33:31.2937367Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2937459Z def test_silu_mul_quant( 2025-05-07T20:33:31.2937542Z self, 2025-05-07T20:33:31.2937619Z T: int, 2025-05-07T20:33:31.2937693Z D: int, 2025-05-07T20:33:31.2937794Z scale_ub: Optional[float], 2025-05-07T20:33:31.2937884Z contiguous: bool, 2025-05-07T20:33:31.2937967Z compiled: bool, 2025-05-07T20:33:31.2938045Z ) -> None: 2025-05-07T20:33:31.2938139Z torch.manual_seed(2025) 2025-05-07T20:33:31.2938214Z 2025-05-07T20:33:31.2938383Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2940327Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.2940374Z 2025-05-07T20:33:31.2940491Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:31.2940496Z 2025-05-07T20:33:31.2940659Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2940889Z self=, 2025-05-07T20:33:31.2940965Z T=4096, 2025-05-07T20:33:31.2941043Z D=7168, 2025-05-07T20:33:31.2941127Z scale_ub=1200.0, 2025-05-07T20:33:31.2941210Z contiguous=True, 2025-05-07T20:33:31.2941297Z compiled=False, 2025-05-07T20:33:31.2941371Z ) 2025-05-07T20:33:31.2941591Z self = 2025-05-07T20:33:31.2941770Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:31.2941775Z 2025-05-07T20:33:31.2941850Z @given( 2025-05-07T20:33:31.2941966Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2942067Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2942179Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2942332Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2942455Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2942528Z ) 2025-05-07T20:33:31.2942783Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2942874Z def test_silu_mul_quant( 2025-05-07T20:33:31.2942949Z self, 2025-05-07T20:33:31.2943028Z T: int, 2025-05-07T20:33:31.2943104Z D: int, 2025-05-07T20:33:31.2943205Z scale_ub: Optional[float], 2025-05-07T20:33:31.2943296Z contiguous: bool, 2025-05-07T20:33:31.2943378Z compiled: bool, 2025-05-07T20:33:31.2943453Z ) -> None: 2025-05-07T20:33:31.2943549Z torch.manual_seed(2025) 2025-05-07T20:33:31.2943619Z 2025-05-07T20:33:31.2943790Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2945704Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.2945714Z 2025-05-07T20:33:31.2945834Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:31.2945838Z 2025-05-07T20:33:31.2945936Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2946163Z self=, 2025-05-07T20:33:31.2946247Z T=16384, 2025-05-07T20:33:31.2946326Z D=7168, 2025-05-07T20:33:31.2946408Z scale_ub=None, 2025-05-07T20:33:31.2946497Z contiguous=False, 2025-05-07T20:33:31.2946584Z compiled=True, 2025-05-07T20:33:31.2946660Z ) 2025-05-07T20:33:31.2946884Z self = 2025-05-07T20:33:31.2947066Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:31.2947071Z 2025-05-07T20:33:31.2947151Z @given( 2025-05-07T20:33:31.2947267Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2947365Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2947480Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2947641Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2947752Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2947829Z ) 2025-05-07T20:33:31.2948076Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2948208Z def test_silu_mul_quant( 2025-05-07T20:33:31.2948293Z self, 2025-05-07T20:33:31.2948376Z T: int, 2025-05-07T20:33:31.2948458Z D: int, 2025-05-07T20:33:31.2948566Z scale_ub: Optional[float], 2025-05-07T20:33:31.2948695Z contiguous: bool, 2025-05-07T20:33:31.2948786Z compiled: bool, 2025-05-07T20:33:31.2948861Z ) -> None: 2025-05-07T20:33:31.2948954Z torch.manual_seed(2025) 2025-05-07T20:33:31.2949032Z 2025-05-07T20:33:31.2949201Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2951108Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.2951157Z 2025-05-07T20:33:31.2951272Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:31.2951279Z 2025-05-07T20:33:31.2951378Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2951608Z self=, 2025-05-07T20:33:31.2951685Z T=4096, 2025-05-07T20:33:31.2951763Z D=7168, 2025-05-07T20:33:31.2951847Z scale_ub=None, 2025-05-07T20:33:31.2951931Z contiguous=True, 2025-05-07T20:33:31.2952019Z compiled=False, 2025-05-07T20:33:31.2952094Z ) 2025-05-07T20:33:31.2952313Z self = 2025-05-07T20:33:31.2952486Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:31.2952491Z 2025-05-07T20:33:31.2952569Z @given( 2025-05-07T20:33:31.2952684Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2952782Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2952897Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2953011Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2953124Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2953196Z ) 2025-05-07T20:33:31.2953450Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2953541Z def test_silu_mul_quant( 2025-05-07T20:33:31.2953618Z self, 2025-05-07T20:33:31.2953701Z T: int, 2025-05-07T20:33:31.2953779Z D: int, 2025-05-07T20:33:31.2953876Z scale_ub: Optional[float], 2025-05-07T20:33:31.2953966Z contiguous: bool, 2025-05-07T20:33:31.2954049Z compiled: bool, 2025-05-07T20:33:31.2954122Z ) -> None: 2025-05-07T20:33:31.2954221Z torch.manual_seed(2025) 2025-05-07T20:33:31.2954296Z 2025-05-07T20:33:31.2954465Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2956380Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.2956432Z 2025-05-07T20:33:31.2956552Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:31.2956556Z 2025-05-07T20:33:31.2956657Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2956921Z self=, 2025-05-07T20:33:31.2957004Z T=16384, 2025-05-07T20:33:31.2957078Z D=7168, 2025-05-07T20:33:31.2957159Z scale_ub=None, 2025-05-07T20:33:31.2957249Z contiguous=True, 2025-05-07T20:33:31.2957331Z compiled=False, 2025-05-07T20:33:31.2957448Z ) 2025-05-07T20:33:31.2957673Z self = 2025-05-07T20:33:31.2957854Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:31.2957858Z 2025-05-07T20:33:31.2957941Z @given( 2025-05-07T20:33:31.2958060Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2958161Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2958282Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2958398Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2958511Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2958593Z ) 2025-05-07T20:33:31.2958847Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2958943Z def test_silu_mul_quant( 2025-05-07T20:33:31.2959027Z self, 2025-05-07T20:33:31.2959148Z T: int, 2025-05-07T20:33:31.2959228Z D: int, 2025-05-07T20:33:31.2959331Z scale_ub: Optional[float], 2025-05-07T20:33:31.2959421Z contiguous: bool, 2025-05-07T20:33:31.2959506Z compiled: bool, 2025-05-07T20:33:31.2959581Z ) -> None: 2025-05-07T20:33:31.2959679Z torch.manual_seed(2025) 2025-05-07T20:33:31.2959755Z 2025-05-07T20:33:31.2959925Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2961840Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.2961851Z 2025-05-07T20:33:31.2961965Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:31.2961969Z 2025-05-07T20:33:31.2962070Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2962299Z self=, 2025-05-07T20:33:31.2962376Z T=16384, 2025-05-07T20:33:31.2962453Z D=7168, 2025-05-07T20:33:31.2962538Z scale_ub=1200.0, 2025-05-07T20:33:31.2962621Z contiguous=True, 2025-05-07T20:33:31.2962711Z compiled=False, 2025-05-07T20:33:31.2962786Z ) 2025-05-07T20:33:31.2963005Z self = 2025-05-07T20:33:31.2963187Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:31.2963192Z 2025-05-07T20:33:31.2963267Z @given( 2025-05-07T20:33:31.2963380Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2963480Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2963595Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2963708Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2963823Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2963896Z ) 2025-05-07T20:33:31.2964150Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2964241Z def test_silu_mul_quant( 2025-05-07T20:33:31.2964363Z self, 2025-05-07T20:33:31.2964448Z T: int, 2025-05-07T20:33:31.2964527Z D: int, 2025-05-07T20:33:31.2964625Z scale_ub: Optional[float], 2025-05-07T20:33:31.2964742Z contiguous: bool, 2025-05-07T20:33:31.2964833Z compiled: bool, 2025-05-07T20:33:31.2965039Z ) -> None: 2025-05-07T20:33:31.2965136Z torch.manual_seed(2025) 2025-05-07T20:33:31.2965209Z 2025-05-07T20:33:31.2965379Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2967289Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.2967339Z 2025-05-07T20:33:31.2967457Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:31.2967461Z 2025-05-07T20:33:31.2967564Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2967795Z self=, 2025-05-07T20:33:31.2967876Z T=128, 2025-05-07T20:33:31.2967956Z D=5120, 2025-05-07T20:33:31.2968080Z scale_ub=1200.0, 2025-05-07T20:33:31.2968177Z contiguous=False, 2025-05-07T20:33:31.2968263Z compiled=False, 2025-05-07T20:33:31.2968340Z ) 2025-05-07T20:33:31.2968566Z self = 2025-05-07T20:33:31.2968742Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:31.2968746Z 2025-05-07T20:33:31.2968829Z @given( 2025-05-07T20:33:31.2968945Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2969047Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2969166Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2969284Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2969403Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2969485Z ) 2025-05-07T20:33:31.2969738Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2969838Z def test_silu_mul_quant( 2025-05-07T20:33:31.2969923Z self, 2025-05-07T20:33:31.2970008Z T: int, 2025-05-07T20:33:31.2970084Z D: int, 2025-05-07T20:33:31.2970188Z scale_ub: Optional[float], 2025-05-07T20:33:31.2970280Z contiguous: bool, 2025-05-07T20:33:31.2970369Z compiled: bool, 2025-05-07T20:33:31.2970447Z ) -> None: 2025-05-07T20:33:31.2970542Z torch.manual_seed(2025) 2025-05-07T20:33:31.2970618Z 2025-05-07T20:33:31.2970787Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2970859Z 2025-05-07T20:33:31.2970950Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2971079Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2971167Z x = x_sign * x_clamp 2025-05-07T20:33:31.2971249Z x0 = x[:, :D] 2025-05-07T20:33:31.2971329Z x1 = x[:, D:] 2025-05-07T20:33:31.2971404Z 2025-05-07T20:33:31.2971490Z if contiguous: 2025-05-07T20:33:31.2971582Z x0 = x0.contiguous() 2025-05-07T20:33:31.2971676Z x1 = x1.contiguous() 2025-05-07T20:33:31.2971752Z 2025-05-07T20:33:31.2971842Z if scale_ub is not None: 2025-05-07T20:33:31.2971947Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2972080Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2972155Z ) 2025-05-07T20:33:31.2972233Z else: 2025-05-07T20:33:31.2972329Z scale_ub_tensor = None 2025-05-07T20:33:31.2972450Z 2025-05-07T20:33:31.2972580Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2972669Z op = silu_mul_quant 2025-05-07T20:33:31.2972753Z if compiled: 2025-05-07T20:33:31.2972849Z op = torch.compile(op) 2025-05-07T20:33:31.2973015Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2973087Z 2025-05-07T20:33:31.2973176Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2973180Z 2025-05-07T20:33:31.2973275Z moe/activation_test.py:117: 2025-05-07T20:33:31.2973450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2973549Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2973647Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2974175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2974273Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2974711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2974939Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2975301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2975396Z kernel = self.compile( 2025-05-07T20:33:31.2975849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2976035Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2976165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2976169Z 2025-05-07T20:33:31.2976376Z self = 2025-05-07T20:33:31.2977189Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2977711Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0527f147c0>} 2025-05-07T20:33:31.2978510Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2978707Z context = 2025-05-07T20:33:31.2978712Z 2025-05-07T20:33:31.2978880Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2980415Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2980523Z module_map=module_map) 2025-05-07T20:33:31.2980687Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2980784Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2980863Z E ^ 2025-05-07T20:33:31.2981240Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2981245Z 2025-05-07T20:33:31.2981680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.2981686Z 2025-05-07T20:33:31.2981791Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2982018Z self=, 2025-05-07T20:33:31.2982095Z T=2048, 2025-05-07T20:33:31.2982170Z D=7168, 2025-05-07T20:33:31.2982247Z scale_ub=None, 2025-05-07T20:33:31.2982335Z contiguous=False, 2025-05-07T20:33:31.2982421Z compiled=False, 2025-05-07T20:33:31.2982545Z ) 2025-05-07T20:33:31.2982770Z self = 2025-05-07T20:33:31.2982952Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:31.2982956Z 2025-05-07T20:33:31.2983038Z @given( 2025-05-07T20:33:31.2983202Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2983302Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2983417Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2983536Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2983688Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2983763Z ) 2025-05-07T20:33:31.2984017Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2984110Z def test_silu_mul_quant( 2025-05-07T20:33:31.2984185Z self, 2025-05-07T20:33:31.2984265Z T: int, 2025-05-07T20:33:31.2984345Z D: int, 2025-05-07T20:33:31.2984445Z scale_ub: Optional[float], 2025-05-07T20:33:31.2984532Z contiguous: bool, 2025-05-07T20:33:31.2984616Z compiled: bool, 2025-05-07T20:33:31.2984695Z ) -> None: 2025-05-07T20:33:31.2984790Z torch.manual_seed(2025) 2025-05-07T20:33:31.2984862Z 2025-05-07T20:33:31.2985040Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2986991Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.2987002Z 2025-05-07T20:33:31.2987125Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:31.2987129Z 2025-05-07T20:33:31.2987230Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.2987457Z self=, 2025-05-07T20:33:31.2987542Z T=128, 2025-05-07T20:33:31.2987617Z D=7168, 2025-05-07T20:33:31.2987702Z scale_ub=1200.0, 2025-05-07T20:33:31.2987787Z contiguous=True, 2025-05-07T20:33:31.2987872Z compiled=True, 2025-05-07T20:33:31.2987949Z ) 2025-05-07T20:33:31.2988171Z self = 2025-05-07T20:33:31.2988341Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:31.2988346Z 2025-05-07T20:33:31.2988424Z @given( 2025-05-07T20:33:31.2988539Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.2988636Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.2988753Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.2988866Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.2988979Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.2989049Z ) 2025-05-07T20:33:31.2989301Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.2989397Z def test_silu_mul_quant( 2025-05-07T20:33:31.2989471Z self, 2025-05-07T20:33:31.2989544Z T: int, 2025-05-07T20:33:31.2989626Z D: int, 2025-05-07T20:33:31.2989722Z scale_ub: Optional[float], 2025-05-07T20:33:31.2989809Z contiguous: bool, 2025-05-07T20:33:31.2989897Z compiled: bool, 2025-05-07T20:33:31.2989972Z ) -> None: 2025-05-07T20:33:31.2990063Z torch.manual_seed(2025) 2025-05-07T20:33:31.2990138Z 2025-05-07T20:33:31.2990307Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.2990379Z 2025-05-07T20:33:31.2990522Z x_sign = torch.sign(x) 2025-05-07T20:33:31.2990646Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.2990735Z x = x_sign * x_clamp 2025-05-07T20:33:31.2990813Z x0 = x[:, :D] 2025-05-07T20:33:31.2990892Z x1 = x[:, D:] 2025-05-07T20:33:31.2990968Z 2025-05-07T20:33:31.2991090Z if contiguous: 2025-05-07T20:33:31.2991181Z x0 = x0.contiguous() 2025-05-07T20:33:31.2991275Z x1 = x1.contiguous() 2025-05-07T20:33:31.2991348Z 2025-05-07T20:33:31.2991438Z if scale_ub is not None: 2025-05-07T20:33:31.2991586Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:31.2991721Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:31.2991799Z ) 2025-05-07T20:33:31.2991880Z else: 2025-05-07T20:33:31.2991976Z scale_ub_tensor = None 2025-05-07T20:33:31.2992052Z 2025-05-07T20:33:31.2992179Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:31.2992269Z op = silu_mul_quant 2025-05-07T20:33:31.2992355Z if compiled: 2025-05-07T20:33:31.2992454Z op = torch.compile(op) 2025-05-07T20:33:31.2992558Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2992635Z 2025-05-07T20:33:31.2992726Z > y_fp8, y_scale = fn() 2025-05-07T20:33:31.2992731Z 2025-05-07T20:33:31.2992824Z moe/activation_test.py:117: 2025-05-07T20:33:31.2993002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2993102Z moe/activation_test.py:115: in fn 2025-05-07T20:33:31.2993208Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:31.2993597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:31.2993687Z return fn(*args, **kwargs) 2025-05-07T20:33:31.2994214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:31.2994313Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:31.2994689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:31.2994927Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:31.2995284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:31.2995382Z kernel = self.compile( 2025-05-07T20:33:31.2995783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:31.2995965Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:31.2996102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:31.2996107Z 2025-05-07T20:33:31.2996314Z self = 2025-05-07T20:33:31.2997130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:31.2997647Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f0527f15940>} 2025-05-07T20:33:31.2998439Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:31.2998640Z context = 2025-05-07T20:33:31.2998645Z 2025-05-07T20:33:31.2998811Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:31.2999086Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:31.2999238Z module_map=module_map) 2025-05-07T20:33:31.2999398Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:31.2999500Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:31.2999576Z E ^ 2025-05-07T20:33:31.2999986Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:31.2999995Z 2025-05-07T20:33:31.3000434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:31.3000475Z 2025-05-07T20:33:31.3000578Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.3000809Z self=, 2025-05-07T20:33:31.3000889Z T=128, 2025-05-07T20:33:31.3000961Z D=7168, 2025-05-07T20:33:31.3001047Z scale_ub=1200.0, 2025-05-07T20:33:31.3001134Z contiguous=True, 2025-05-07T20:33:31.3001213Z compiled=False, 2025-05-07T20:33:31.3001293Z ) 2025-05-07T20:33:31.3001515Z self = 2025-05-07T20:33:31.3001688Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:31.3001693Z 2025-05-07T20:33:31.3001773Z @given( 2025-05-07T20:33:31.3001887Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.3002028Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.3002143Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.3002261Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.3002374Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.3002447Z ) 2025-05-07T20:33:31.3002699Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.3002789Z def test_silu_mul_quant( 2025-05-07T20:33:31.3002862Z self, 2025-05-07T20:33:31.3002944Z T: int, 2025-05-07T20:33:31.3003019Z D: int, 2025-05-07T20:33:31.3003115Z scale_ub: Optional[float], 2025-05-07T20:33:31.3003204Z contiguous: bool, 2025-05-07T20:33:31.3003285Z compiled: bool, 2025-05-07T20:33:31.3003360Z ) -> None: 2025-05-07T20:33:31.3003458Z torch.manual_seed(2025) 2025-05-07T20:33:31.3003533Z 2025-05-07T20:33:31.3003704Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.3003785Z 2025-05-07T20:33:31.3003876Z x_sign = torch.sign(x) 2025-05-07T20:33:31.3004000Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.3005911Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.3005919Z 2025-05-07T20:33:31.3006040Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:31.3006044Z 2025-05-07T20:33:31.3006145Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.3006377Z self=, 2025-05-07T20:33:31.3006456Z T=128, 2025-05-07T20:33:31.3006540Z D=5120, 2025-05-07T20:33:31.3006621Z scale_ub=1200.0, 2025-05-07T20:33:31.3006709Z contiguous=True, 2025-05-07T20:33:31.3006792Z compiled=True, 2025-05-07T20:33:31.3006869Z ) 2025-05-07T20:33:31.3007096Z self = 2025-05-07T20:33:31.3007263Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:31.3007336Z 2025-05-07T20:33:31.3007422Z @given( 2025-05-07T20:33:31.3007540Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.3007640Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.3007759Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.3007915Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.3008029Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.3008105Z ) 2025-05-07T20:33:31.3008358Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.3008490Z def test_silu_mul_quant( 2025-05-07T20:33:31.3008574Z self, 2025-05-07T20:33:31.3008652Z T: int, 2025-05-07T20:33:31.3008736Z D: int, 2025-05-07T20:33:31.3008832Z scale_ub: Optional[float], 2025-05-07T20:33:31.3008917Z contiguous: bool, 2025-05-07T20:33:31.3009003Z compiled: bool, 2025-05-07T20:33:31.3009079Z ) -> None: 2025-05-07T20:33:31.3009176Z torch.manual_seed(2025) 2025-05-07T20:33:31.3009253Z 2025-05-07T20:33:31.3009418Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.3009491Z 2025-05-07T20:33:31.3009585Z x_sign = torch.sign(x) 2025-05-07T20:33:31.3009708Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:31.3011652Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.3011662Z 2025-05-07T20:33:31.3011779Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:31.3011784Z 2025-05-07T20:33:31.3011883Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:31.3012112Z self=, 2025-05-07T20:33:31.3012189Z T=128, 2025-05-07T20:33:31.3012271Z D=7168, 2025-05-07T20:33:31.3012352Z scale_ub=None, 2025-05-07T20:33:31.3012434Z contiguous=True, 2025-05-07T20:33:31.3012518Z compiled=True, 2025-05-07T20:33:31.3012595Z ) 2025-05-07T20:33:31.3012816Z self = 2025-05-07T20:33:31.3012987Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:31.3012992Z 2025-05-07T20:33:31.3013073Z @given( 2025-05-07T20:33:31.3013191Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:31.3013295Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:31.3013409Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:31.3013532Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:31.3013645Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:31.3013719Z ) 2025-05-07T20:33:31.3013972Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:31.3014072Z def test_silu_mul_quant( 2025-05-07T20:33:31.3014152Z self, 2025-05-07T20:33:31.3014235Z T: int, 2025-05-07T20:33:31.3014313Z D: int, 2025-05-07T20:33:31.3014417Z scale_ub: Optional[float], 2025-05-07T20:33:31.3014573Z contiguous: bool, 2025-05-07T20:33:31.3014664Z compiled: bool, 2025-05-07T20:33:31.3014739Z ) -> None: 2025-05-07T20:33:31.3014838Z torch.manual_seed(2025) 2025-05-07T20:33:31.3014928Z 2025-05-07T20:33:31.3015123Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:31.3017066Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:31.3017111Z 2025-05-07T20:33:31.3017231Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:31.3017404Z =============================== warnings summary =============================== 2025-05-07T20:33:31.3017724Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:31.3018035Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:31.3018345Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:31.3019283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:31.3019515Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:31.3019560Z 2025-05-07T20:33:31.3019778Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:31.3019951Z ================= 1 failed, 1 deselected, 3 warnings in 13.45s ================= 2025-05-07T20:33:33.0122728Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:33.0797047Z [EXEC] [ATTEMPT 1/2] Command attempt failed. 2025-05-07T20:33:33.0797500Z 2025-05-07T20:33:35.0816550Z [EXEC] [ATTEMPT 2/2] + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py 2025-05-07T20:33:37.2571166Z ============================= test session starts ============================== 2025-05-07T20:33:37.2572453Z platform linux -- Python 3.12.2, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python 2025-05-07T20:33:37.2573542Z cachedir: .pytest_cache 2025-05-07T20:33:37.2574876Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:33:37.2576166Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu 2025-05-07T20:33:37.2576608Z plugins: hypothesis-6.131.14 2025-05-07T20:33:38.8858267Z TMA benchmarks will be running with experimental grid constant TMA descriptor. 2025-05-07T20:33:38.9946899Z collecting ... collected 2 items / 1 deselected / 1 selected 2025-05-07T20:33:38.9947486Z run-last-failure: rerun previous 1 failure 2025-05-07T20:33:38.9947799Z 2025-05-07T20:33:41.3916935Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.3918566Z self=, 2025-05-07T20:33:41.3919428Z T=1, 2025-05-07T20:33:41.3919805Z D=5120, 2025-05-07T20:33:41.3920194Z scale_ub=None, 2025-05-07T20:33:41.3920632Z contiguous=True, 2025-05-07T20:33:41.3921070Z compiled=True, 2025-05-07T20:33:41.3921469Z ) 2025-05-07T20:33:41.3922114Z self = 2025-05-07T20:33:41.3923102Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:41.3923637Z 2025-05-07T20:33:41.3923795Z @given( 2025-05-07T20:33:41.3924265Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:41.3933588Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:41.3933966Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:41.3934318Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:41.3934960Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:41.3935270Z ) 2025-05-07T20:33:41.3935642Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:41.3936110Z def test_silu_mul_quant( 2025-05-07T20:33:41.3936369Z self, 2025-05-07T20:33:41.3936668Z T: int, 2025-05-07T20:33:41.3936859Z D: int, 2025-05-07T20:33:41.3937107Z scale_ub: Optional[float], 2025-05-07T20:33:41.3937416Z contiguous: bool, 2025-05-07T20:33:41.3937658Z compiled: bool, 2025-05-07T20:33:41.3937894Z ) -> None: 2025-05-07T20:33:41.3938121Z torch.manual_seed(2025) 2025-05-07T20:33:41.3938364Z 2025-05-07T20:33:41.3938650Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:41.3939015Z 2025-05-07T20:33:41.3939207Z x_sign = torch.sign(x) 2025-05-07T20:33:41.3939508Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:41.3939827Z x = x_sign * x_clamp 2025-05-07T20:33:41.3940070Z x0 = x[:, :D] 2025-05-07T20:33:41.3940289Z x1 = x[:, D:] 2025-05-07T20:33:41.3940504Z 2025-05-07T20:33:41.3940689Z if contiguous: 2025-05-07T20:33:41.3941024Z x0 = x0.contiguous() 2025-05-07T20:33:41.3941300Z x1 = x1.contiguous() 2025-05-07T20:33:41.3941558Z 2025-05-07T20:33:41.3941755Z if scale_ub is not None: 2025-05-07T20:33:41.3942050Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:41.3942403Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:41.3942719Z ) 2025-05-07T20:33:41.3942921Z else: 2025-05-07T20:33:41.3943141Z scale_ub_tensor = None 2025-05-07T20:33:41.3943400Z 2025-05-07T20:33:41.3943641Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.3943967Z op = silu_mul_quant 2025-05-07T20:33:41.3944213Z if compiled: 2025-05-07T20:33:41.3944464Z op = torch.compile(op) 2025-05-07T20:33:41.3944779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:41.3945060Z 2025-05-07T20:33:41.3945256Z y_fp8, y_scale = fn() 2025-05-07T20:33:41.3945548Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:41.3945846Z 2025-05-07T20:33:41.3946083Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:41.3946433Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:41.3946742Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:41.3947084Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:41.3947485Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.3947811Z 2025-05-07T20:33:41.3948011Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:41.3948222Z 2025-05-07T20:33:41.3948325Z moe/activation_test.py:126: 2025-05-07T20:33:41.3948635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.3948984Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:41.3949324Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:41.3950161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:41.3950962Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:41.3951529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:41.3952248Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:41.3952978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:41.3953880Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:41.3954643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:41.3955363Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:41.3956007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:41.3956547Z fn() 2025-05-07T20:33:41.3957125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:41.3957747Z self.fn.run( 2025-05-07T20:33:41.3958234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:41.3958785Z kernel = self.compile( 2025-05-07T20:33:41.3959353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:41.3960043Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:41.3960451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:41.3960700Z 2025-05-07T20:33:41.3960914Z self = 2025-05-07T20:33:41.3962094Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:41.3963561Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c0b065c60>} 2025-05-07T20:33:41.3964981Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:41.3966076Z context = 2025-05-07T20:33:41.3966389Z 2025-05-07T20:33:41.3966566Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:41.3967118Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:41.3967612Z module_map=module_map) 2025-05-07T20:33:41.3967990Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:41.3968366Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:41.3968652Z E ^ 2025-05-07T20:33:41.3969130Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:41.3969611Z 2025-05-07T20:33:41.3970051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:41.3970606Z 2025-05-07T20:33:41.3970715Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:41.3971148Z self=, 2025-05-07T20:33:41.3971565Z T=2048, 2025-05-07T20:33:41.3971766Z D=5120, 2025-05-07T20:33:41.3971970Z scale_ub=1200.0, 2025-05-07T20:33:41.3972197Z contiguous=True, 2025-05-07T20:33:41.3972434Z compiled=False, 2025-05-07T20:33:41.3972670Z ) 2025-05-07T20:33:42.1328763Z self = 2025-05-07T20:33:42.1330417Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:42.1331201Z 2025-05-07T20:33:42.1331422Z @given( 2025-05-07T20:33:42.1331999Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:42.1332647Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:42.1333257Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:42.1334345Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:42.1335174Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:42.1335740Z ) 2025-05-07T20:33:42.1336578Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:42.1337383Z def test_silu_mul_quant( 2025-05-07T20:33:42.1337678Z self, 2025-05-07T20:33:42.1337870Z T: int, 2025-05-07T20:33:42.1338084Z D: int, 2025-05-07T20:33:42.1338310Z scale_ub: Optional[float], 2025-05-07T20:33:42.1338673Z contiguous: bool, 2025-05-07T20:33:42.1338920Z compiled: bool, 2025-05-07T20:33:42.1339149Z ) -> None: 2025-05-07T20:33:42.1339366Z torch.manual_seed(2025) 2025-05-07T20:33:42.1339614Z 2025-05-07T20:33:42.1339894Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:42.1340244Z 2025-05-07T20:33:42.1340446Z x_sign = torch.sign(x) 2025-05-07T20:33:42.1340753Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:42.1341064Z x = x_sign * x_clamp 2025-05-07T20:33:42.1341317Z x0 = x[:, :D] 2025-05-07T20:33:42.1341536Z x1 = x[:, D:] 2025-05-07T20:33:42.1341747Z 2025-05-07T20:33:42.1341950Z if contiguous: 2025-05-07T20:33:42.1342191Z x0 = x0.contiguous() 2025-05-07T20:33:42.1342450Z x1 = x1.contiguous() 2025-05-07T20:33:42.1342703Z 2025-05-07T20:33:42.1342981Z if scale_ub is not None: 2025-05-07T20:33:42.1343257Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:42.1343598Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:42.1343914Z ) 2025-05-07T20:33:42.1344112Z else: 2025-05-07T20:33:42.1344319Z scale_ub_tensor = None 2025-05-07T20:33:42.1344573Z 2025-05-07T20:33:42.1344810Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:42.1345128Z op = silu_mul_quant 2025-05-07T20:33:42.1345384Z if compiled: 2025-05-07T20:33:42.1345637Z op = torch.compile(op) 2025-05-07T20:33:42.1345937Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.1346221Z 2025-05-07T20:33:42.1346423Z > y_fp8, y_scale = fn() 2025-05-07T20:33:42.1346593Z 2025-05-07T20:33:42.1346695Z moe/activation_test.py:117: 2025-05-07T20:33:42.1347006Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.1347362Z moe/activation_test.py:115: in fn 2025-05-07T20:33:42.1347660Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.1348381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:42.1349117Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:42.1349686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:42.1350407Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:42.1351105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:42.1351670Z kernel = self.compile( 2025-05-07T20:33:42.1352249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:42.1352937Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:42.1353353Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.1353595Z 2025-05-07T20:33:42.1353819Z self = 2025-05-07T20:33:42.1354947Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:42.1356450Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c0aebc220>} 2025-05-07T20:33:42.1357974Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:42.1359084Z context = 2025-05-07T20:33:42.1359428Z 2025-05-07T20:33:42.1359615Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:42.1360165Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:42.1360665Z module_map=module_map) 2025-05-07T20:33:42.1361046Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:42.1361429Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:42.1361695Z E ^ 2025-05-07T20:33:42.1362185Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:42.1362661Z 2025-05-07T20:33:42.1363116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:42.1363664Z 2025-05-07T20:33:42.1363779Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:42.1364289Z self=, 2025-05-07T20:33:42.1364728Z T=2048, 2025-05-07T20:33:42.1364932Z D=5120, 2025-05-07T20:33:42.1365137Z scale_ub=1200.0, 2025-05-07T20:33:42.1365373Z contiguous=True, 2025-05-07T20:33:42.1365610Z compiled=True, 2025-05-07T20:33:42.1365822Z ) 2025-05-07T20:33:42.1366164Z self = 2025-05-07T20:33:42.1366688Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:42.1366976Z 2025-05-07T20:33:42.1367060Z @given( 2025-05-07T20:33:42.1367300Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:42.1367629Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:42.1367951Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:42.1368291Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:42.1368648Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:42.1368947Z ) 2025-05-07T20:33:42.1369309Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:42.1369790Z def test_silu_mul_quant( 2025-05-07T20:33:42.1370034Z self, 2025-05-07T20:33:42.1370239Z T: int, 2025-05-07T20:33:42.1370438Z D: int, 2025-05-07T20:33:42.1370661Z scale_ub: Optional[float], 2025-05-07T20:33:42.1370944Z contiguous: bool, 2025-05-07T20:33:42.1371197Z compiled: bool, 2025-05-07T20:33:42.1371417Z ) -> None: 2025-05-07T20:33:42.1371631Z torch.manual_seed(2025) 2025-05-07T20:33:42.1371873Z 2025-05-07T20:33:42.1372144Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:42.1372498Z 2025-05-07T20:33:42.1372699Z x_sign = torch.sign(x) 2025-05-07T20:33:42.1372984Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:42.1373303Z x = x_sign * x_clamp 2025-05-07T20:33:42.1373547Z x0 = x[:, :D] 2025-05-07T20:33:42.1373760Z x1 = x[:, D:] 2025-05-07T20:33:42.1373972Z 2025-05-07T20:33:42.1374161Z if contiguous: 2025-05-07T20:33:42.1374387Z x0 = x0.contiguous() 2025-05-07T20:33:42.1374738Z x1 = x1.contiguous() 2025-05-07T20:33:42.1374985Z 2025-05-07T20:33:42.1375186Z if scale_ub is not None: 2025-05-07T20:33:42.1375460Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:42.1375804Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:42.1376172Z ) 2025-05-07T20:33:42.1376360Z else: 2025-05-07T20:33:42.1376569Z scale_ub_tensor = None 2025-05-07T20:33:42.1376824Z 2025-05-07T20:33:42.1377047Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:42.1377414Z op = silu_mul_quant 2025-05-07T20:33:42.1377672Z if compiled: 2025-05-07T20:33:42.1377915Z op = torch.compile(op) 2025-05-07T20:33:42.1378218Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.1378542Z 2025-05-07T20:33:42.1378727Z y_fp8, y_scale = fn() 2025-05-07T20:33:42.1379013Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:42.1379311Z 2025-05-07T20:33:42.1379546Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:42.1379890Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:42.1380191Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:42.1380516Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:42.1380875Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:42.1381191Z 2025-05-07T20:33:42.1381396Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:42.1381593Z 2025-05-07T20:33:42.1381693Z moe/activation_test.py:126: 2025-05-07T20:33:42.1381994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.1382386Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:42.1382712Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:42.1383537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:42.1384333Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:42.1384909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:42.1385624Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:42.1386352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:42.1387116Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:42.1387945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:42.1388617Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:42.1389250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:42.1389800Z fn() 2025-05-07T20:33:42.1390331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:42.1390950Z self.fn.run( 2025-05-07T20:33:42.1391445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:42.1392005Z kernel = self.compile( 2025-05-07T20:33:42.1392563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:42.1393259Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:42.1393679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.1393922Z 2025-05-07T20:33:42.1394147Z self = 2025-05-07T20:33:42.1395271Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:42.1396696Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c0aebd8a0>} 2025-05-07T20:33:42.1398200Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:42.1399293Z context = 2025-05-07T20:33:42.1399595Z 2025-05-07T20:33:42.1399767Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:42.1400317Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:42.1400845Z module_map=module_map) 2025-05-07T20:33:42.1401225Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:42.1401592Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:42.1401876Z E ^ 2025-05-07T20:33:42.1402358Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:42.1402833Z 2025-05-07T20:33:42.1403269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:42.1403818Z 2025-05-07T20:33:42.1403925Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:42.1404349Z self=, 2025-05-07T20:33:42.1404765Z T=16384, 2025-05-07T20:33:42.1405001Z D=7168, 2025-05-07T20:33:42.1405201Z scale_ub=1200.0, 2025-05-07T20:33:42.1405430Z contiguous=False, 2025-05-07T20:33:42.1405651Z compiled=False, 2025-05-07T20:33:42.1405858Z ) 2025-05-07T20:33:42.8839355Z self = 2025-05-07T20:33:42.8840221Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:42.8840661Z 2025-05-07T20:33:42.8840753Z @given( 2025-05-07T20:33:42.8841034Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:42.8841350Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:42.8841674Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:42.8842025Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:42.8842382Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:42.8842682Z ) 2025-05-07T20:33:42.8843057Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:42.8843530Z def test_silu_mul_quant( 2025-05-07T20:33:42.8843787Z self, 2025-05-07T20:33:42.8843993Z T: int, 2025-05-07T20:33:42.8844201Z D: int, 2025-05-07T20:33:42.8844425Z scale_ub: Optional[float], 2025-05-07T20:33:42.8844712Z contiguous: bool, 2025-05-07T20:33:42.8844960Z compiled: bool, 2025-05-07T20:33:42.8845190Z ) -> None: 2025-05-07T20:33:42.8845409Z torch.manual_seed(2025) 2025-05-07T20:33:42.8845653Z 2025-05-07T20:33:42.8845924Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:42.8846279Z 2025-05-07T20:33:42.8846482Z x_sign = torch.sign(x) 2025-05-07T20:33:42.8846770Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:42.8847088Z x = x_sign * x_clamp 2025-05-07T20:33:42.8847326Z x0 = x[:, :D] 2025-05-07T20:33:42.8847540Z x1 = x[:, D:] 2025-05-07T20:33:42.8847741Z 2025-05-07T20:33:42.8847941Z if contiguous: 2025-05-07T20:33:42.8848211Z x0 = x0.contiguous() 2025-05-07T20:33:42.8848472Z x1 = x1.contiguous() 2025-05-07T20:33:42.8848713Z 2025-05-07T20:33:42.8848908Z if scale_ub is not None: 2025-05-07T20:33:42.8849175Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:42.8849515Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:42.8849829Z ) 2025-05-07T20:33:42.8850016Z else: 2025-05-07T20:33:42.8850544Z scale_ub_tensor = None 2025-05-07T20:33:42.8850809Z 2025-05-07T20:33:42.8851036Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:42.8851355Z op = silu_mul_quant 2025-05-07T20:33:42.8851605Z if compiled: 2025-05-07T20:33:42.8851935Z op = torch.compile(op) 2025-05-07T20:33:42.8852246Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.8852533Z 2025-05-07T20:33:42.8852732Z > y_fp8, y_scale = fn() 2025-05-07T20:33:42.8852902Z 2025-05-07T20:33:42.8853007Z moe/activation_test.py:117: 2025-05-07T20:33:42.8853395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.8853747Z moe/activation_test.py:115: in fn 2025-05-07T20:33:42.8854035Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.8854881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:42.8855626Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:42.8856188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:42.8856915Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:42.8857622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:42.8858241Z kernel = self.compile( 2025-05-07T20:33:42.8858889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:42.8859593Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:42.8860008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.8860252Z 2025-05-07T20:33:42.8860475Z self = 2025-05-07T20:33:42.8861605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:42.8863061Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c09d487c0>} 2025-05-07T20:33:42.8864479Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:42.8865576Z context = 2025-05-07T20:33:42.8865879Z 2025-05-07T20:33:42.8866058Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:42.8866595Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:42.8867089Z module_map=module_map) 2025-05-07T20:33:42.8867465Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:42.8867823Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:42.8868096Z E ^ 2025-05-07T20:33:42.8868583Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:42.8869055Z 2025-05-07T20:33:42.8869502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:42.8870046Z 2025-05-07T20:33:42.8870152Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:42.8870584Z self=, 2025-05-07T20:33:42.8871006Z T=1, 2025-05-07T20:33:42.8871188Z D=7168, 2025-05-07T20:33:42.8871383Z scale_ub=None, 2025-05-07T20:33:42.8871597Z contiguous=True, 2025-05-07T20:33:42.8871875Z compiled=True, 2025-05-07T20:33:42.8872087Z ) 2025-05-07T20:33:42.8872420Z self = 2025-05-07T20:33:42.8872918Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:42.8873196Z 2025-05-07T20:33:42.8873277Z @given( 2025-05-07T20:33:42.8873576Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:42.8873895Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:42.8874204Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:42.8874605Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:42.8874941Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:42.8875232Z ) 2025-05-07T20:33:42.8875588Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:42.8876049Z def test_silu_mul_quant( 2025-05-07T20:33:42.8876294Z self, 2025-05-07T20:33:42.8876493Z T: int, 2025-05-07T20:33:42.8876697Z D: int, 2025-05-07T20:33:42.8876913Z scale_ub: Optional[float], 2025-05-07T20:33:42.8877192Z contiguous: bool, 2025-05-07T20:33:42.8877453Z compiled: bool, 2025-05-07T20:33:42.8885126Z ) -> None: 2025-05-07T20:33:42.8885377Z torch.manual_seed(2025) 2025-05-07T20:33:42.8885635Z 2025-05-07T20:33:42.8885927Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:42.8886297Z 2025-05-07T20:33:42.8886579Z x_sign = torch.sign(x) 2025-05-07T20:33:42.8886887Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:42.8887220Z x = x_sign * x_clamp 2025-05-07T20:33:42.8887480Z x0 = x[:, :D] 2025-05-07T20:33:42.8887749Z x1 = x[:, D:] 2025-05-07T20:33:42.8887968Z 2025-05-07T20:33:42.8888158Z if contiguous: 2025-05-07T20:33:42.8888408Z x0 = x0.contiguous() 2025-05-07T20:33:42.8888685Z x1 = x1.contiguous() 2025-05-07T20:33:42.8888943Z 2025-05-07T20:33:42.8889139Z if scale_ub is not None: 2025-05-07T20:33:42.8889428Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:42.8889780Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:42.8890094Z ) 2025-05-07T20:33:42.8890304Z else: 2025-05-07T20:33:42.8890526Z scale_ub_tensor = None 2025-05-07T20:33:42.8890785Z 2025-05-07T20:33:42.8891028Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:42.8891364Z op = silu_mul_quant 2025-05-07T20:33:42.8891622Z if compiled: 2025-05-07T20:33:42.8891882Z op = torch.compile(op) 2025-05-07T20:33:42.8892198Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:42.8892479Z 2025-05-07T20:33:42.8892684Z y_fp8, y_scale = fn() 2025-05-07T20:33:42.8892984Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:42.8893288Z 2025-05-07T20:33:42.8893528Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:42.8893883Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:42.8894191Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:42.8894510Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:42.8894962Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:42.8895326Z 2025-05-07T20:33:42.8895603Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:42.8895882Z 2025-05-07T20:33:42.8895992Z moe/activation_test.py:126: 2025-05-07T20:33:42.8896301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.8896650Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:42.8896980Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:42.8897804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:42.8898712Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:42.8899275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:42.8899989Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:42.8900750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:42.8901524Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:42.8902327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:42.8903003Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:42.8903638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:42.8904191Z fn() 2025-05-07T20:33:42.8904723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:42.8905355Z self.fn.run( 2025-05-07T20:33:42.8905853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:42.8906415Z kernel = self.compile( 2025-05-07T20:33:42.8906990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:42.8907730Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:42.8908143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:42.8908398Z 2025-05-07T20:33:42.8908612Z self = 2025-05-07T20:33:42.8909747Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:42.8911186Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c09d7a840>} 2025-05-07T20:33:42.8912609Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:42.8913695Z context = 2025-05-07T20:33:42.8914007Z 2025-05-07T20:33:42.8914180Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:42.8914735Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:42.8915227Z module_map=module_map) 2025-05-07T20:33:42.8915599Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:42.8915979Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:42.8916268Z E ^ 2025-05-07T20:33:42.8916746Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:42.8917227Z 2025-05-07T20:33:42.8917669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:42.8918221Z 2025-05-07T20:33:42.8918331Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:42.8918763Z self=, 2025-05-07T20:33:42.8919184Z T=4096, 2025-05-07T20:33:42.8919383Z D=5120, 2025-05-07T20:33:42.8919580Z scale_ub=None, 2025-05-07T20:33:42.8919798Z contiguous=False, 2025-05-07T20:33:42.8920029Z compiled=False, 2025-05-07T20:33:42.8920238Z ) 2025-05-07T20:33:43.6886239Z self = 2025-05-07T20:33:43.6888079Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:43.6888516Z 2025-05-07T20:33:43.6888624Z @given( 2025-05-07T20:33:43.6888861Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.6889196Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.6889603Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.6889947Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.6890296Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.6890661Z ) 2025-05-07T20:33:43.6891015Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.6891479Z def test_silu_mul_quant( 2025-05-07T20:33:43.6891728Z self, 2025-05-07T20:33:43.6891931Z T: int, 2025-05-07T20:33:43.6892128Z D: int, 2025-05-07T20:33:43.6892353Z scale_ub: Optional[float], 2025-05-07T20:33:43.6892634Z contiguous: bool, 2025-05-07T20:33:43.6892876Z compiled: bool, 2025-05-07T20:33:43.6893107Z ) -> None: 2025-05-07T20:33:43.6893328Z torch.manual_seed(2025) 2025-05-07T20:33:43.6893566Z 2025-05-07T20:33:43.6893849Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.6894215Z 2025-05-07T20:33:43.6894409Z x_sign = torch.sign(x) 2025-05-07T20:33:43.6894824Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:43.6895220Z x = x_sign * x_clamp 2025-05-07T20:33:43.6895460Z x0 = x[:, :D] 2025-05-07T20:33:43.6895685Z x1 = x[:, D:] 2025-05-07T20:33:43.6895901Z 2025-05-07T20:33:43.6896086Z if contiguous: 2025-05-07T20:33:43.6896330Z x0 = x0.contiguous() 2025-05-07T20:33:43.6896599Z x1 = x1.contiguous() 2025-05-07T20:33:43.6896844Z 2025-05-07T20:33:43.6897045Z if scale_ub is not None: 2025-05-07T20:33:43.6897325Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:43.6897675Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:43.6898031Z ) 2025-05-07T20:33:43.6898227Z else: 2025-05-07T20:33:43.6898445Z scale_ub_tensor = None 2025-05-07T20:33:43.6898704Z 2025-05-07T20:33:43.6898941Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:43.6899265Z op = silu_mul_quant 2025-05-07T20:33:43.6899510Z if compiled: 2025-05-07T20:33:43.6899764Z op = torch.compile(op) 2025-05-07T20:33:43.6900073Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.6900356Z 2025-05-07T20:33:43.6900557Z > y_fp8, y_scale = fn() 2025-05-07T20:33:43.6900720Z 2025-05-07T20:33:43.6900827Z moe/activation_test.py:117: 2025-05-07T20:33:43.6901122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.6901477Z moe/activation_test.py:115: in fn 2025-05-07T20:33:43.6901779Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.6902518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:43.6903251Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:43.6903828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:43.6904566Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:43.6905282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:43.6905847Z kernel = self.compile( 2025-05-07T20:33:43.6906431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:43.6907131Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:43.6907543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.6907854Z 2025-05-07T20:33:43.6908068Z self = 2025-05-07T20:33:43.6909264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:43.6910724Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c09d7bec0>} 2025-05-07T20:33:43.6912203Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:43.6913298Z context = 2025-05-07T20:33:43.6913617Z 2025-05-07T20:33:43.6913799Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:43.6914362Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:43.6914868Z module_map=module_map) 2025-05-07T20:33:43.6915256Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:43.6915641Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:43.6915931Z E ^ 2025-05-07T20:33:43.6916463Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:43.6916955Z 2025-05-07T20:33:43.6917400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:43.6917951Z 2025-05-07T20:33:43.6918062Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.6918503Z self=, 2025-05-07T20:33:43.6918927Z T=4096, 2025-05-07T20:33:43.6919133Z D=7168, 2025-05-07T20:33:43.6919344Z scale_ub=None, 2025-05-07T20:33:43.6919568Z contiguous=False, 2025-05-07T20:33:43.6919807Z compiled=False, 2025-05-07T20:33:43.6920019Z ) 2025-05-07T20:33:43.6920353Z self = 2025-05-07T20:33:43.6920882Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:43.6921170Z 2025-05-07T20:33:43.6921261Z @given( 2025-05-07T20:33:43.6921497Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.6921830Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.6922158Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.6922501Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.6922836Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.6923140Z ) 2025-05-07T20:33:43.6923502Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.6923964Z def test_silu_mul_quant( 2025-05-07T20:33:43.6924217Z self, 2025-05-07T20:33:43.6924427Z T: int, 2025-05-07T20:33:43.6924622Z D: int, 2025-05-07T20:33:43.6924850Z scale_ub: Optional[float], 2025-05-07T20:33:43.6925138Z contiguous: bool, 2025-05-07T20:33:43.6925382Z compiled: bool, 2025-05-07T20:33:43.6925800Z ) -> None: 2025-05-07T20:33:43.6926015Z torch.manual_seed(2025) 2025-05-07T20:33:43.6926263Z 2025-05-07T20:33:43.6926540Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.6926889Z 2025-05-07T20:33:43.6927097Z x_sign = torch.sign(x) 2025-05-07T20:33:43.6927386Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:43.6927707Z x = x_sign * x_clamp 2025-05-07T20:33:43.6927948Z x0 = x[:, :D] 2025-05-07T20:33:43.6928162Z x1 = x[:, D:] 2025-05-07T20:33:43.6928374Z 2025-05-07T20:33:43.6928646Z if contiguous: 2025-05-07T20:33:43.6928873Z x0 = x0.contiguous() 2025-05-07T20:33:43.6929141Z x1 = x1.contiguous() 2025-05-07T20:33:43.6929386Z 2025-05-07T20:33:43.6929576Z if scale_ub is not None: 2025-05-07T20:33:43.6929917Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:43.6930263Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:43.6930584Z ) 2025-05-07T20:33:43.6930772Z else: 2025-05-07T20:33:43.6930992Z scale_ub_tensor = None 2025-05-07T20:33:43.6931340Z 2025-05-07T20:33:43.6931568Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:43.6931893Z op = silu_mul_quant 2025-05-07T20:33:43.6932145Z if compiled: 2025-05-07T20:33:43.6932390Z op = torch.compile(op) 2025-05-07T20:33:43.6932692Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.6932973Z 2025-05-07T20:33:43.6933164Z > y_fp8, y_scale = fn() 2025-05-07T20:33:43.6933338Z 2025-05-07T20:33:43.6933438Z moe/activation_test.py:117: 2025-05-07T20:33:43.6933740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.6934075Z moe/activation_test.py:115: in fn 2025-05-07T20:33:43.6934366Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.6935223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:43.6935960Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:43.6936521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:43.6937244Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:43.6938000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:43.6938568Z kernel = self.compile( 2025-05-07T20:33:43.6939135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:43.6939830Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:43.6940249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.6940487Z 2025-05-07T20:33:43.6940698Z self = 2025-05-07T20:33:43.6941829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:43.6943264Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c0915ca40>} 2025-05-07T20:33:43.6944677Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:43.6945774Z context = 2025-05-07T20:33:43.6946073Z 2025-05-07T20:33:43.6946249Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:43.6946803Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:43.6947298Z module_map=module_map) 2025-05-07T20:33:43.6947674Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:43.6948043Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:43.6948320Z E ^ 2025-05-07T20:33:43.6948810Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:43.6949283Z 2025-05-07T20:33:43.6949771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:43.6950324Z 2025-05-07T20:33:43.6950430Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.6950860Z self=, 2025-05-07T20:33:43.6951333Z T=128, 2025-05-07T20:33:43.6951521Z D=7168, 2025-05-07T20:33:43.6951728Z scale_ub=None, 2025-05-07T20:33:43.6951959Z contiguous=False, 2025-05-07T20:33:43.6952191Z compiled=True, 2025-05-07T20:33:43.6952402Z ) 2025-05-07T20:33:43.7519944Z self = 2025-05-07T20:33:43.7521283Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:43.7522038Z 2025-05-07T20:33:43.7522233Z @given( 2025-05-07T20:33:43.7522816Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.7523411Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.7524042Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.7524674Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.7525296Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.7526149Z ) 2025-05-07T20:33:43.7526798Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.7527638Z def test_silu_mul_quant( 2025-05-07T20:33:43.7528079Z self, 2025-05-07T20:33:43.7528581Z T: int, 2025-05-07T20:33:43.7528817Z D: int, 2025-05-07T20:33:43.7529055Z scale_ub: Optional[float], 2025-05-07T20:33:43.7529341Z contiguous: bool, 2025-05-07T20:33:43.7529602Z compiled: bool, 2025-05-07T20:33:43.7529845Z ) -> None: 2025-05-07T20:33:43.7530074Z torch.manual_seed(2025) 2025-05-07T20:33:43.7530334Z 2025-05-07T20:33:43.7530631Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.7530997Z 2025-05-07T20:33:43.7531207Z x_sign = torch.sign(x) 2025-05-07T20:33:43.7531517Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:43.7531858Z x = x_sign * x_clamp 2025-05-07T20:33:43.7532113Z x0 = x[:, :D] 2025-05-07T20:33:43.7532351Z x1 = x[:, D:] 2025-05-07T20:33:43.7532583Z 2025-05-07T20:33:43.7532780Z if contiguous: 2025-05-07T20:33:43.7533030Z x0 = x0.contiguous() 2025-05-07T20:33:43.7533312Z x1 = x1.contiguous() 2025-05-07T20:33:43.7533558Z 2025-05-07T20:33:43.7533759Z if scale_ub is not None: 2025-05-07T20:33:43.7534040Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:43.7534377Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:43.7534797Z ) 2025-05-07T20:33:43.7534999Z else: 2025-05-07T20:33:43.7535204Z scale_ub_tensor = None 2025-05-07T20:33:43.7535461Z 2025-05-07T20:33:43.7535694Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:43.7536006Z op = silu_mul_quant 2025-05-07T20:33:43.7536260Z if compiled: 2025-05-07T20:33:43.7536508Z op = torch.compile(op) 2025-05-07T20:33:43.7536800Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.7537081Z 2025-05-07T20:33:43.7537270Z y_fp8, y_scale = fn() 2025-05-07T20:33:43.7537563Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:43.7537852Z 2025-05-07T20:33:43.7538096Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:43.7538438Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:43.7538734Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:43.7539053Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:43.7539416Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:43.7539730Z 2025-05-07T20:33:43.7539927Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:43.7540199Z 2025-05-07T20:33:43.7540305Z moe/activation_test.py:126: 2025-05-07T20:33:43.7540602Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.7540935Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:43.7541260Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:43.7542145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:43.7542939Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:43.7543575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:43.7544297Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:43.7545021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:43.7545776Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:43.7546549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:43.7547223Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:43.7547856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:43.7548401Z fn() 2025-05-07T20:33:43.7548978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:43.7549601Z self.fn.run( 2025-05-07T20:33:43.7550085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:43.7550641Z kernel = self.compile( 2025-05-07T20:33:43.7551205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:43.7551893Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:43.7552307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.7552551Z 2025-05-07T20:33:43.7552765Z self = 2025-05-07T20:33:43.7553895Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:43.7555322Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c09d48360>} 2025-05-07T20:33:43.7556725Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:43.7557814Z context = 2025-05-07T20:33:43.7558125Z 2025-05-07T20:33:43.7558295Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:43.7558838Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:43.7559318Z module_map=module_map) 2025-05-07T20:33:43.7559693Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:43.7560062Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:43.7560338Z E ^ 2025-05-07T20:33:43.7560826Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:43.7561300Z 2025-05-07T20:33:43.7561736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:43.7562341Z 2025-05-07T20:33:43.7562452Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.7562876Z self=, 2025-05-07T20:33:43.7563293Z T=128, 2025-05-07T20:33:43.7563491Z D=7168, 2025-05-07T20:33:43.7563671Z scale_ub=None, 2025-05-07T20:33:43.7563933Z contiguous=False, 2025-05-07T20:33:43.7564165Z compiled=False, 2025-05-07T20:33:43.7564362Z ) 2025-05-07T20:33:43.9556086Z self = 2025-05-07T20:33:43.9557620Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:43.9558285Z 2025-05-07T20:33:43.9558378Z @given( 2025-05-07T20:33:43.9558606Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.9558920Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.9559230Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.9559565Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.9559906Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.9560192Z ) 2025-05-07T20:33:43.9560542Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.9560995Z def test_silu_mul_quant( 2025-05-07T20:33:43.9561228Z self, 2025-05-07T20:33:43.9561422Z T: int, 2025-05-07T20:33:43.9561617Z D: int, 2025-05-07T20:33:43.9561827Z scale_ub: Optional[float], 2025-05-07T20:33:43.9562173Z contiguous: bool, 2025-05-07T20:33:43.9562415Z compiled: bool, 2025-05-07T20:33:43.9562634Z ) -> None: 2025-05-07T20:33:43.9562844Z torch.manual_seed(2025) 2025-05-07T20:33:43.9563083Z 2025-05-07T20:33:43.9563353Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.9563698Z 2025-05-07T20:33:43.9563892Z x_sign = torch.sign(x) 2025-05-07T20:33:43.9564180Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:43.9564491Z x = x_sign * x_clamp 2025-05-07T20:33:43.9564728Z x0 = x[:, :D] 2025-05-07T20:33:43.9564935Z x1 = x[:, D:] 2025-05-07T20:33:43.9565136Z 2025-05-07T20:33:43.9565319Z if contiguous: 2025-05-07T20:33:43.9565550Z x0 = x0.contiguous() 2025-05-07T20:33:43.9565802Z x1 = x1.contiguous() 2025-05-07T20:33:43.9572472Z 2025-05-07T20:33:43.9572680Z if scale_ub is not None: 2025-05-07T20:33:43.9572979Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:43.9573323Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:43.9573643Z ) 2025-05-07T20:33:43.9573848Z else: 2025-05-07T20:33:43.9574071Z scale_ub_tensor = None 2025-05-07T20:33:43.9574326Z 2025-05-07T20:33:43.9574658Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:43.9574986Z op = silu_mul_quant 2025-05-07T20:33:43.9575235Z if compiled: 2025-05-07T20:33:43.9575495Z op = torch.compile(op) 2025-05-07T20:33:43.9575801Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.9576075Z 2025-05-07T20:33:43.9576268Z > y_fp8, y_scale = fn() 2025-05-07T20:33:43.9576433Z 2025-05-07T20:33:43.9576539Z moe/activation_test.py:117: 2025-05-07T20:33:43.9576850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.9577192Z moe/activation_test.py:115: in fn 2025-05-07T20:33:43.9577490Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.9578269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:43.9578999Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:43.9579563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:43.9580277Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:43.9581081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:43.9581637Z kernel = self.compile( 2025-05-07T20:33:43.9582263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:43.9582951Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:43.9583358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.9583610Z 2025-05-07T20:33:43.9583860Z self = 2025-05-07T20:33:43.9584985Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:43.9586413Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c097c1940>} 2025-05-07T20:33:43.9587823Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:43.9588900Z context = 2025-05-07T20:33:43.9589204Z 2025-05-07T20:33:43.9589415Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:43.9589954Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:43.9590436Z module_map=module_map) 2025-05-07T20:33:43.9590799Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:43.9591155Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:43.9591421Z E ^ 2025-05-07T20:33:43.9591901Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:43.9592375Z 2025-05-07T20:33:43.9592808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:43.9593360Z 2025-05-07T20:33:43.9593462Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.9593884Z self=, 2025-05-07T20:33:43.9594293Z T=4096, 2025-05-07T20:33:43.9594480Z D=5120, 2025-05-07T20:33:43.9594684Z scale_ub=1200.0, 2025-05-07T20:33:43.9594900Z contiguous=True, 2025-05-07T20:33:43.9595122Z compiled=False, 2025-05-07T20:33:43.9595325Z ) 2025-05-07T20:33:43.9595640Z self = 2025-05-07T20:33:43.9596147Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:43.9596433Z 2025-05-07T20:33:43.9596519Z @given( 2025-05-07T20:33:43.9596750Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:43.9597060Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:43.9597371Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:43.9597704Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:43.9598031Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:43.9598320Z ) 2025-05-07T20:33:43.9598679Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:43.9599132Z def test_silu_mul_quant( 2025-05-07T20:33:43.9599377Z self, 2025-05-07T20:33:43.9599576Z T: int, 2025-05-07T20:33:43.9599764Z D: int, 2025-05-07T20:33:43.9599985Z scale_ub: Optional[float], 2025-05-07T20:33:43.9600258Z contiguous: bool, 2025-05-07T20:33:43.9600494Z compiled: bool, 2025-05-07T20:33:43.9600717Z ) -> None: 2025-05-07T20:33:43.9600933Z torch.manual_seed(2025) 2025-05-07T20:33:43.9601230Z 2025-05-07T20:33:43.9601503Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:43.9601859Z 2025-05-07T20:33:43.9602063Z x_sign = torch.sign(x) 2025-05-07T20:33:43.9602352Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:43.9602706Z x = x_sign * x_clamp 2025-05-07T20:33:43.9602952Z x0 = x[:, :D] 2025-05-07T20:33:43.9603166Z x1 = x[:, D:] 2025-05-07T20:33:43.9603374Z 2025-05-07T20:33:43.9603558Z if contiguous: 2025-05-07T20:33:43.9603824Z x0 = x0.contiguous() 2025-05-07T20:33:43.9604080Z x1 = x1.contiguous() 2025-05-07T20:33:43.9604317Z 2025-05-07T20:33:43.9604501Z if scale_ub is not None: 2025-05-07T20:33:43.9604782Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:43.9605119Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:43.9605429Z ) 2025-05-07T20:33:43.9605620Z else: 2025-05-07T20:33:43.9605824Z scale_ub_tensor = None 2025-05-07T20:33:43.9606068Z 2025-05-07T20:33:43.9606298Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:43.9606610Z op = silu_mul_quant 2025-05-07T20:33:43.9606854Z if compiled: 2025-05-07T20:33:43.9607097Z op = torch.compile(op) 2025-05-07T20:33:43.9607392Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.9607666Z 2025-05-07T20:33:43.9607899Z > y_fp8, y_scale = fn() 2025-05-07T20:33:43.9608067Z 2025-05-07T20:33:43.9608167Z moe/activation_test.py:117: 2025-05-07T20:33:43.9608462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.9608793Z moe/activation_test.py:115: in fn 2025-05-07T20:33:43.9609076Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:43.9609793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:43.9610518Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:43.9611070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:43.9611784Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:43.9612476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:43.9613030Z kernel = self.compile( 2025-05-07T20:33:43.9613587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:43.9614270Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:43.9614786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:43.9615024Z 2025-05-07T20:33:43.9615234Z self = 2025-05-07T20:33:43.9616361Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:43.9617788Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c097c2a20>} 2025-05-07T20:33:43.9619204Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:43.9620288Z context = 2025-05-07T20:33:43.9620584Z 2025-05-07T20:33:43.9620751Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:43.9621283Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:43.9621813Z module_map=module_map) 2025-05-07T20:33:43.9622176Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:43.9622533Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:43.9622800Z E ^ 2025-05-07T20:33:43.9623314Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:43.9623789Z 2025-05-07T20:33:43.9624230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:43.9624815Z 2025-05-07T20:33:43.9624921Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:43.9625345Z self=, 2025-05-07T20:33:43.9625938Z T=1, 2025-05-07T20:33:43.9626128Z D=5120, 2025-05-07T20:33:43.9626329Z scale_ub=None, 2025-05-07T20:33:43.9626547Z contiguous=True, 2025-05-07T20:33:43.9626765Z compiled=True, 2025-05-07T20:33:43.9626966Z ) 2025-05-07T20:33:44.3401749Z self = 2025-05-07T20:33:44.3403260Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:44.3404003Z 2025-05-07T20:33:44.3404245Z @given( 2025-05-07T20:33:44.3404696Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.3405536Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.3406152Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.3406818Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.3407483Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.3407893Z ) 2025-05-07T20:33:44.3408283Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.3408739Z def test_silu_mul_quant( 2025-05-07T20:33:44.3408983Z self, 2025-05-07T20:33:44.3409187Z T: int, 2025-05-07T20:33:44.3409387Z D: int, 2025-05-07T20:33:44.3409609Z scale_ub: Optional[float], 2025-05-07T20:33:44.3409881Z contiguous: bool, 2025-05-07T20:33:44.3410115Z compiled: bool, 2025-05-07T20:33:44.3410344Z ) -> None: 2025-05-07T20:33:44.3410558Z torch.manual_seed(2025) 2025-05-07T20:33:44.3410791Z 2025-05-07T20:33:44.3411075Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.3411435Z 2025-05-07T20:33:44.3411622Z x_sign = torch.sign(x) 2025-05-07T20:33:44.3411913Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.3412224Z x = x_sign * x_clamp 2025-05-07T20:33:44.3412464Z x0 = x[:, :D] 2025-05-07T20:33:44.3412682Z x1 = x[:, D:] 2025-05-07T20:33:44.3412894Z 2025-05-07T20:33:44.3413083Z if contiguous: 2025-05-07T20:33:44.3413311Z x0 = x0.contiguous() 2025-05-07T20:33:44.3413568Z x1 = x1.contiguous() 2025-05-07T20:33:44.3413814Z 2025-05-07T20:33:44.3414006Z if scale_ub is not None: 2025-05-07T20:33:44.3414284Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.3414738Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.3415049Z ) 2025-05-07T20:33:44.3415245Z else: 2025-05-07T20:33:44.3415460Z scale_ub_tensor = None 2025-05-07T20:33:44.3415712Z 2025-05-07T20:33:44.3415951Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.3416276Z op = silu_mul_quant 2025-05-07T20:33:44.3416524Z if compiled: 2025-05-07T20:33:44.3416778Z op = torch.compile(op) 2025-05-07T20:33:44.3417080Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.3417351Z 2025-05-07T20:33:44.3417542Z y_fp8, y_scale = fn() 2025-05-07T20:33:44.3417829Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:44.3418126Z 2025-05-07T20:33:44.3418436Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.3418786Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:44.3419078Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:44.3419393Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:44.3419820Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.3420131Z 2025-05-07T20:33:44.3420335Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:44.3420539Z 2025-05-07T20:33:44.3420639Z moe/activation_test.py:126: 2025-05-07T20:33:44.3421023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.3421352Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:44.3421677Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.3422495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:44.3423279Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:44.3423843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.3424560Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.3425281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:44.3426278Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.3427046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:44.3427716Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:44.3428394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:44.3428929Z fn() 2025-05-07T20:33:44.3429463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:44.3430073Z self.fn.run( 2025-05-07T20:33:44.3430550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.3431112Z kernel = self.compile( 2025-05-07T20:33:44.3431683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.3432370Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.3432779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.3433019Z 2025-05-07T20:33:44.3433233Z self = 2025-05-07T20:33:44.3434363Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.3435794Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c097c37e0>} 2025-05-07T20:33:44.3437206Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.3438295Z context = 2025-05-07T20:33:44.3438598Z 2025-05-07T20:33:44.3438768Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.3439307Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.3439787Z module_map=module_map) 2025-05-07T20:33:44.3440225Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.3440590Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:44.3440868Z E ^ 2025-05-07T20:33:44.3441342Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.3441875Z 2025-05-07T20:33:44.3442313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.3442859Z 2025-05-07T20:33:44.3442968Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.3443444Z self=, 2025-05-07T20:33:44.3443865Z T=2048, 2025-05-07T20:33:44.3444047Z D=5120, 2025-05-07T20:33:44.3444227Z scale_ub=None, 2025-05-07T20:33:44.3444440Z contiguous=True, 2025-05-07T20:33:44.3444660Z compiled=True, 2025-05-07T20:33:44.3444859Z ) 2025-05-07T20:33:44.7071939Z self = 2025-05-07T20:33:44.7072711Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:44.7073105Z 2025-05-07T20:33:44.7073218Z @given( 2025-05-07T20:33:44.7073530Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:44.7073928Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:44.7074247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:44.7074699Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:44.7075061Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:44.7075363Z ) 2025-05-07T20:33:44.7075722Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:44.7076174Z def test_silu_mul_quant( 2025-05-07T20:33:44.7076423Z self, 2025-05-07T20:33:44.7076628Z T: int, 2025-05-07T20:33:44.7076825Z D: int, 2025-05-07T20:33:44.7077053Z scale_ub: Optional[float], 2025-05-07T20:33:44.7077335Z contiguous: bool, 2025-05-07T20:33:44.7077577Z compiled: bool, 2025-05-07T20:33:44.7077811Z ) -> None: 2025-05-07T20:33:44.7078033Z torch.manual_seed(2025) 2025-05-07T20:33:44.7078272Z 2025-05-07T20:33:44.7078539Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:44.7078890Z 2025-05-07T20:33:44.7079085Z x_sign = torch.sign(x) 2025-05-07T20:33:44.7079370Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:44.7079682Z x = x_sign * x_clamp 2025-05-07T20:33:44.7079920Z x0 = x[:, :D] 2025-05-07T20:33:44.7080122Z x1 = x[:, D:] 2025-05-07T20:33:44.7080325Z 2025-05-07T20:33:44.7080509Z if contiguous: 2025-05-07T20:33:44.7080736Z x0 = x0.contiguous() 2025-05-07T20:33:44.7080995Z x1 = x1.contiguous() 2025-05-07T20:33:44.7081233Z 2025-05-07T20:33:44.7081417Z if scale_ub is not None: 2025-05-07T20:33:44.7081690Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:44.7082029Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:44.7082341Z ) 2025-05-07T20:33:44.7082534Z else: 2025-05-07T20:33:44.7082751Z scale_ub_tensor = None 2025-05-07T20:33:44.7083007Z 2025-05-07T20:33:44.7083238Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.7083556Z op = silu_mul_quant 2025-05-07T20:33:44.7083803Z if compiled: 2025-05-07T20:33:44.7084048Z op = torch.compile(op) 2025-05-07T20:33:44.7084346Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:44.7084620Z 2025-05-07T20:33:44.7084802Z y_fp8, y_scale = fn() 2025-05-07T20:33:44.7085084Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:44.7085374Z 2025-05-07T20:33:44.7085604Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:44.7085944Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:44.7086314Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:44.7086626Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:44.7086988Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.7087298Z 2025-05-07T20:33:44.7087558Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:44.7087758Z 2025-05-07T20:33:44.7087857Z moe/activation_test.py:126: 2025-05-07T20:33:44.7088155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.7088499Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:44.7088884Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:44.7089697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:44.7090485Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:44.7091051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:44.7091763Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:44.7092480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:44.7093239Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:44.7094039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:44.7094811Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:44.7095442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:44.7095984Z fn() 2025-05-07T20:33:44.7096516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:44.7097135Z self.fn.run( 2025-05-07T20:33:44.7097623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:44.7098186Z kernel = self.compile( 2025-05-07T20:33:44.7098748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:44.7099432Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:44.7099849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:44.7100089Z 2025-05-07T20:33:44.7100303Z self = 2025-05-07T20:33:44.7101429Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:44.7102859Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c0954de40>} 2025-05-07T20:33:44.7104268Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:44.7105354Z context = 2025-05-07T20:33:44.7105656Z 2025-05-07T20:33:44.7105827Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:44.7106369Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:44.7106856Z module_map=module_map) 2025-05-07T20:33:44.7107229Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:44.7107597Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:44.7107921Z E ^ 2025-05-07T20:33:44.7108403Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:44.7108873Z 2025-05-07T20:33:44.7109350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:44.7109901Z 2025-05-07T20:33:44.7110006Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:44.7110432Z self=, 2025-05-07T20:33:44.7110847Z T=128, 2025-05-07T20:33:44.7111078Z D=5120, 2025-05-07T20:33:44.7111269Z scale_ub=None, 2025-05-07T20:33:44.7111484Z contiguous=True, 2025-05-07T20:33:44.7111696Z compiled=True, 2025-05-07T20:33:44.7111893Z ) 2025-05-07T20:33:45.1352499Z self = 2025-05-07T20:33:45.1353310Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:45.1353693Z 2025-05-07T20:33:45.1353800Z @given( 2025-05-07T20:33:45.1354094Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:45.1354418Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:45.1354726Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:45.1355061Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:45.1355394Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:45.1355694Z ) 2025-05-07T20:33:45.1356158Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:45.1356620Z def test_silu_mul_quant( 2025-05-07T20:33:45.1356866Z self, 2025-05-07T20:33:45.1357053Z T: int, 2025-05-07T20:33:45.1357249Z D: int, 2025-05-07T20:33:45.1357465Z scale_ub: Optional[float], 2025-05-07T20:33:45.1357735Z contiguous: bool, 2025-05-07T20:33:45.1357977Z compiled: bool, 2025-05-07T20:33:45.1358197Z ) -> None: 2025-05-07T20:33:45.1358406Z torch.manual_seed(2025) 2025-05-07T20:33:45.1358650Z 2025-05-07T20:33:45.1358927Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:45.1359275Z 2025-05-07T20:33:45.1359466Z x_sign = torch.sign(x) 2025-05-07T20:33:45.1359760Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:45.1360071Z x = x_sign * x_clamp 2025-05-07T20:33:45.1360308Z x0 = x[:, :D] 2025-05-07T20:33:45.1360519Z x1 = x[:, D:] 2025-05-07T20:33:45.1360725Z 2025-05-07T20:33:45.1360907Z if contiguous: 2025-05-07T20:33:45.1361141Z x0 = x0.contiguous() 2025-05-07T20:33:45.1361403Z x1 = x1.contiguous() 2025-05-07T20:33:45.1361641Z 2025-05-07T20:33:45.1361838Z if scale_ub is not None: 2025-05-07T20:33:45.1362108Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:45.1362436Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:45.1362745Z ) 2025-05-07T20:33:45.1362942Z else: 2025-05-07T20:33:45.1363148Z scale_ub_tensor = None 2025-05-07T20:33:45.1363408Z 2025-05-07T20:33:45.1369923Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:45.1370265Z op = silu_mul_quant 2025-05-07T20:33:45.1370538Z if compiled: 2025-05-07T20:33:45.1370793Z op = torch.compile(op) 2025-05-07T20:33:45.1371108Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:45.1371403Z 2025-05-07T20:33:45.1371604Z y_fp8, y_scale = fn() 2025-05-07T20:33:45.1371907Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:45.1372215Z 2025-05-07T20:33:45.1372464Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:45.1372818Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:45.1373128Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:45.1373449Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:45.1373931Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:45.1374258Z 2025-05-07T20:33:45.1374471Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:45.1374752Z 2025-05-07T20:33:45.1374858Z moe/activation_test.py:126: 2025-05-07T20:33:45.1375243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:45.1375595Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:45.1375930Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:45.1376759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:45.1377624Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:45.1378204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:45.1378919Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:45.1379658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:45.1380428Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:45.1381204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:45.1381927Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:45.1382568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:45.1383130Z fn() 2025-05-07T20:33:45.1383667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:45.1384293Z self.fn.run( 2025-05-07T20:33:45.1384796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:45.1385367Z kernel = self.compile( 2025-05-07T20:33:45.1385938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:45.1386641Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:45.1387070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:45.1387315Z 2025-05-07T20:33:45.1387538Z self = 2025-05-07T20:33:45.1388678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:45.1390119Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c085a2ac0>} 2025-05-07T20:33:45.1391541Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:45.1392637Z context = 2025-05-07T20:33:45.1392943Z 2025-05-07T20:33:45.1393117Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:45.1393667Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:45.1394161Z module_map=module_map) 2025-05-07T20:33:45.1394538Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:45.1394920Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:45.1395206Z E ^ 2025-05-07T20:33:45.1395697Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:45.1396225Z 2025-05-07T20:33:45.1396667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:45.1397219Z 2025-05-07T20:33:45.1397329Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:45.1397805Z self=, 2025-05-07T20:33:45.1398237Z T=4096, 2025-05-07T20:33:45.1398438Z D=5120, 2025-05-07T20:33:45.1398643Z scale_ub=None, 2025-05-07T20:33:45.1398875Z contiguous=True, 2025-05-07T20:33:45.1399106Z compiled=True, 2025-05-07T20:33:45.1399371Z ) 2025-05-07T20:33:45.5679612Z self = 2025-05-07T20:33:45.5680394Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:45.5680763Z 2025-05-07T20:33:45.5680863Z @given( 2025-05-07T20:33:45.5681105Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:45.5681459Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:45.5681784Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:45.5682130Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:45.5682475Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:45.5682777Z ) 2025-05-07T20:33:45.5683134Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:45.5683596Z def test_silu_mul_quant( 2025-05-07T20:33:45.5684005Z self, 2025-05-07T20:33:45.5684213Z T: int, 2025-05-07T20:33:45.5684412Z D: int, 2025-05-07T20:33:45.5684635Z scale_ub: Optional[float], 2025-05-07T20:33:45.5684912Z contiguous: bool, 2025-05-07T20:33:45.5685149Z compiled: bool, 2025-05-07T20:33:45.5685379Z ) -> None: 2025-05-07T20:33:45.5685600Z torch.manual_seed(2025) 2025-05-07T20:33:45.5685838Z 2025-05-07T20:33:45.5686113Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:45.5686473Z 2025-05-07T20:33:45.5686665Z x_sign = torch.sign(x) 2025-05-07T20:33:45.5686961Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:45.5687278Z x = x_sign * x_clamp 2025-05-07T20:33:45.5687517Z x0 = x[:, :D] 2025-05-07T20:33:45.5687742Z x1 = x[:, D:] 2025-05-07T20:33:45.5687962Z 2025-05-07T20:33:45.5688144Z if contiguous: 2025-05-07T20:33:45.5688380Z x0 = x0.contiguous() 2025-05-07T20:33:45.5688649Z x1 = x1.contiguous() 2025-05-07T20:33:45.5688891Z 2025-05-07T20:33:45.5689083Z if scale_ub is not None: 2025-05-07T20:33:45.5689359Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:45.5689704Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:45.5690012Z ) 2025-05-07T20:33:45.5690213Z else: 2025-05-07T20:33:45.5690427Z scale_ub_tensor = None 2025-05-07T20:33:45.5690678Z 2025-05-07T20:33:45.5690917Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:45.5691244Z op = silu_mul_quant 2025-05-07T20:33:45.5691488Z if compiled: 2025-05-07T20:33:45.5691750Z op = torch.compile(op) 2025-05-07T20:33:45.5692056Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:45.5692346Z 2025-05-07T20:33:45.5692552Z y_fp8, y_scale = fn() 2025-05-07T20:33:45.5692845Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:45.5693142Z 2025-05-07T20:33:45.5693390Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:45.5693742Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:45.5694051Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:45.5694367Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:45.5694883Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:45.5695208Z 2025-05-07T20:33:45.5695406Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:45.5695702Z 2025-05-07T20:33:45.5695806Z moe/activation_test.py:126: 2025-05-07T20:33:45.5696114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:45.5696455Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:45.5696871Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:45.5697704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:45.5698582Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:45.5699151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:45.5699870Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:45.5700593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:45.5701358Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:45.5702126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:45.5702806Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:45.5703443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:45.5704103Z fn() 2025-05-07T20:33:45.5704648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:45.5705273Z self.fn.run( 2025-05-07T20:33:45.5705769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:45.5706334Z kernel = self.compile( 2025-05-07T20:33:45.5706903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:45.5707600Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:45.5708009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:45.5708255Z 2025-05-07T20:33:45.5708472Z self = 2025-05-07T20:33:45.5709607Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:45.5711051Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c0853b4c0>} 2025-05-07T20:33:45.5712459Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:45.5713541Z context = 2025-05-07T20:33:45.5713851Z 2025-05-07T20:33:45.5714021Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:45.5714566Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:45.5715050Z module_map=module_map) 2025-05-07T20:33:45.5715425Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:45.5715798Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:45.5716079Z E ^ 2025-05-07T20:33:45.5716545Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:45.5717019Z 2025-05-07T20:33:45.5717452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:45.5718053Z 2025-05-07T20:33:45.5718159Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:45.5718590Z self=, 2025-05-07T20:33:45.5719008Z T=16384, 2025-05-07T20:33:45.5719204Z D=5120, 2025-05-07T20:33:45.5719457Z scale_ub=None, 2025-05-07T20:33:45.5719675Z contiguous=True, 2025-05-07T20:33:45.5719905Z compiled=True, 2025-05-07T20:33:45.5720120Z ) 2025-05-07T20:33:45.5979646Z W0507 20:33:45.595000 96958 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8) 2025-05-07T20:33:45.5982826Z W0507 20:33:45.595000 96958 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55) 2025-05-07T20:33:45.5985632Z W0507 20:33:45.595000 96958 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240 2025-05-07T20:33:45.5987711Z W0507 20:33:45.595000 96958 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". 2025-05-07T20:33:45.5989327Z W0507 20:33:45.595000 96958 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. 2025-05-07T20:33:45.6863015Z self = 2025-05-07T20:33:45.6863811Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:45.6864214Z 2025-05-07T20:33:45.6864327Z @given( 2025-05-07T20:33:45.6864659Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:45.6865023Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:45.6865357Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:45.6865718Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:45.6866074Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:45.6866376Z ) 2025-05-07T20:33:45.6866748Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:45.6867224Z def test_silu_mul_quant( 2025-05-07T20:33:45.6867484Z self, 2025-05-07T20:33:45.6867698Z T: int, 2025-05-07T20:33:45.6867917Z D: int, 2025-05-07T20:33:45.6868153Z scale_ub: Optional[float], 2025-05-07T20:33:45.6868442Z contiguous: bool, 2025-05-07T20:33:45.6868699Z compiled: bool, 2025-05-07T20:33:45.6868947Z ) -> None: 2025-05-07T20:33:45.6869174Z torch.manual_seed(2025) 2025-05-07T20:33:45.6869418Z 2025-05-07T20:33:45.6869704Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:45.6870066Z 2025-05-07T20:33:45.6870263Z x_sign = torch.sign(x) 2025-05-07T20:33:45.6870566Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:45.6870896Z x = x_sign * x_clamp 2025-05-07T20:33:45.6871139Z x0 = x[:, :D] 2025-05-07T20:33:45.6871362Z x1 = x[:, D:] 2025-05-07T20:33:45.6871577Z 2025-05-07T20:33:45.6871766Z if contiguous: 2025-05-07T20:33:45.6872011Z x0 = x0.contiguous() 2025-05-07T20:33:45.6872283Z x1 = x1.contiguous() 2025-05-07T20:33:45.6872532Z 2025-05-07T20:33:45.6872733Z if scale_ub is not None: 2025-05-07T20:33:45.6873021Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:45.6873365Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:45.6873689Z ) 2025-05-07T20:33:45.6873896Z else: 2025-05-07T20:33:45.6874117Z scale_ub_tensor = None 2025-05-07T20:33:45.6874375Z 2025-05-07T20:33:45.6874617Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:45.6874948Z op = silu_mul_quant 2025-05-07T20:33:45.6875290Z if compiled: 2025-05-07T20:33:45.6875548Z op = torch.compile(op) 2025-05-07T20:33:45.6875860Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:45.6876152Z 2025-05-07T20:33:45.6876355Z y_fp8, y_scale = fn() 2025-05-07T20:33:45.6876746Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:45.6877048Z 2025-05-07T20:33:45.6877302Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:45.6877654Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:45.6877959Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:45.6878354Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:45.6878730Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:45.6879055Z 2025-05-07T20:33:45.6879258Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:45.6879468Z 2025-05-07T20:33:45.6879572Z moe/activation_test.py:126: 2025-05-07T20:33:45.6879884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:45.6880232Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:45.6880572Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:45.6881406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:45.6882206Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:45.6882820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:45.6883542Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:45.6884272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:45.6885031Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:45.6885817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:45.6886507Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:45.6887149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:45.6887704Z fn() 2025-05-07T20:33:45.6888240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:45.6888862Z self.fn.run( 2025-05-07T20:33:45.6889361Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:45.6889919Z kernel = self.compile( 2025-05-07T20:33:45.6890489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:45.6891178Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:45.6891595Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:45.6891838Z 2025-05-07T20:33:45.6892051Z self = 2025-05-07T20:33:45.6893191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:45.6894803Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1df41580>} 2025-05-07T20:33:45.6896227Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:45.6897311Z context = 2025-05-07T20:33:45.6897672Z 2025-05-07T20:33:45.6897849Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:45.6898434Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:45.6898997Z module_map=module_map) 2025-05-07T20:33:45.6899379Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:45.6899758Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:45.6900050Z E ^ 2025-05-07T20:33:45.6900536Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:45.6901060Z 2025-05-07T20:33:45.6901499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:45.6902051Z 2025-05-07T20:33:45.6902161Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:45.6902594Z self=, 2025-05-07T20:33:45.6903020Z T=1, 2025-05-07T20:33:45.6903221Z D=5120, 2025-05-07T20:33:45.6903424Z scale_ub=1200.0, 2025-05-07T20:33:45.6903646Z contiguous=True, 2025-05-07T20:33:45.6903872Z compiled=True, 2025-05-07T20:33:45.6904081Z ) 2025-05-07T20:33:45.8351592Z self = 2025-05-07T20:33:45.8353026Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:45.8353588Z 2025-05-07T20:33:45.8353753Z @given( 2025-05-07T20:33:45.8354213Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:45.8354852Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:45.8355461Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:45.8356130Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:45.8356795Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:45.8357367Z ) 2025-05-07T20:33:45.8358074Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:45.8358829Z def test_silu_mul_quant( 2025-05-07T20:33:45.8359091Z self, 2025-05-07T20:33:45.8359289Z T: int, 2025-05-07T20:33:45.8359494Z D: int, 2025-05-07T20:33:45.8359721Z scale_ub: Optional[float], 2025-05-07T20:33:45.8359992Z contiguous: bool, 2025-05-07T20:33:45.8360238Z compiled: bool, 2025-05-07T20:33:45.8360467Z ) -> None: 2025-05-07T20:33:45.8360682Z torch.manual_seed(2025) 2025-05-07T20:33:45.8360930Z 2025-05-07T20:33:45.8361215Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:45.8361564Z 2025-05-07T20:33:45.8361763Z x_sign = torch.sign(x) 2025-05-07T20:33:45.8362061Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:45.8362375Z x = x_sign * x_clamp 2025-05-07T20:33:45.8362619Z x0 = x[:, :D] 2025-05-07T20:33:45.8362842Z x1 = x[:, D:] 2025-05-07T20:33:45.8363045Z 2025-05-07T20:33:45.8363240Z if contiguous: 2025-05-07T20:33:45.8363483Z x0 = x0.contiguous() 2025-05-07T20:33:45.8363741Z x1 = x1.contiguous() 2025-05-07T20:33:45.8363989Z 2025-05-07T20:33:45.8364188Z if scale_ub is not None: 2025-05-07T20:33:45.8364462Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:45.8364804Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:45.8365123Z ) 2025-05-07T20:33:45.8365329Z else: 2025-05-07T20:33:45.8365541Z scale_ub_tensor = None 2025-05-07T20:33:45.8365800Z 2025-05-07T20:33:45.8366034Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:45.8366352Z op = silu_mul_quant 2025-05-07T20:33:45.8366607Z if compiled: 2025-05-07T20:33:45.8366860Z op = torch.compile(op) 2025-05-07T20:33:45.8367162Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:45.8367518Z 2025-05-07T20:33:45.8367716Z > y_fp8, y_scale = fn() 2025-05-07T20:33:45.8367882Z 2025-05-07T20:33:45.8367983Z moe/activation_test.py:117: 2025-05-07T20:33:45.8368289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:45.8368693Z moe/activation_test.py:115: in fn 2025-05-07T20:33:45.8368986Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:45.8369569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:45.8370231Z return fn(*args, **kwargs) 2025-05-07T20:33:45.8370930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:45.8371661Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:45.8372237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:45.8372970Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:45.8373678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:45.8374239Z kernel = self.compile( 2025-05-07T20:33:45.8374892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:45.8375645Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:45.8376062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:45.8376317Z 2025-05-07T20:33:45.8376533Z self = 2025-05-07T20:33:45.8377669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:45.8379165Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1da20680>} 2025-05-07T20:33:45.8380593Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:45.8381687Z context = 2025-05-07T20:33:45.8382001Z 2025-05-07T20:33:45.8382174Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:45.8382722Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:45.8383216Z module_map=module_map) 2025-05-07T20:33:45.8383592Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:45.8383968Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:45.8384242Z E ^ 2025-05-07T20:33:45.8384725Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:45.8385207Z 2025-05-07T20:33:45.8385654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:45.8386209Z 2025-05-07T20:33:45.8386318Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:45.8386756Z self=, 2025-05-07T20:33:45.8387180Z T=1, 2025-05-07T20:33:45.8387380Z D=5120, 2025-05-07T20:33:45.8387586Z scale_ub=None, 2025-05-07T20:33:45.8387807Z contiguous=False, 2025-05-07T20:33:45.8388040Z compiled=True, 2025-05-07T20:33:45.8388251Z ) 2025-05-07T20:33:46.0788171Z self = 2025-05-07T20:33:46.0788825Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:46.0789417Z 2025-05-07T20:33:46.0789537Z @given( 2025-05-07T20:33:46.0789850Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:46.0790192Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:46.0790580Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:46.0790916Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:46.0791257Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:46.0791552Z ) 2025-05-07T20:33:46.0791903Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:46.0792424Z def test_silu_mul_quant( 2025-05-07T20:33:46.0792674Z self, 2025-05-07T20:33:46.0792864Z T: int, 2025-05-07T20:33:46.0793063Z D: int, 2025-05-07T20:33:46.0793280Z scale_ub: Optional[float], 2025-05-07T20:33:46.0793554Z contiguous: bool, 2025-05-07T20:33:46.0793796Z compiled: bool, 2025-05-07T20:33:46.0794023Z ) -> None: 2025-05-07T20:33:46.0794242Z torch.manual_seed(2025) 2025-05-07T20:33:46.0794485Z 2025-05-07T20:33:46.0794763Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:46.0795123Z 2025-05-07T20:33:46.0795313Z x_sign = torch.sign(x) 2025-05-07T20:33:46.0795610Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:46.0795928Z x = x_sign * x_clamp 2025-05-07T20:33:46.0803636Z x0 = x[:, :D] 2025-05-07T20:33:46.0803890Z x1 = x[:, D:] 2025-05-07T20:33:46.0804118Z 2025-05-07T20:33:46.0804319Z if contiguous: 2025-05-07T20:33:46.0804569Z x0 = x0.contiguous() 2025-05-07T20:33:46.0804834Z x1 = x1.contiguous() 2025-05-07T20:33:46.0805085Z 2025-05-07T20:33:46.0805281Z if scale_ub is not None: 2025-05-07T20:33:46.0805557Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:46.0805904Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:46.0806233Z ) 2025-05-07T20:33:46.0806422Z else: 2025-05-07T20:33:46.0806638Z scale_ub_tensor = None 2025-05-07T20:33:46.0806903Z 2025-05-07T20:33:46.0807137Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:46.0807471Z op = silu_mul_quant 2025-05-07T20:33:46.0807733Z if compiled: 2025-05-07T20:33:46.0807985Z op = torch.compile(op) 2025-05-07T20:33:46.0808300Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.0808584Z 2025-05-07T20:33:46.0808776Z y_fp8, y_scale = fn() 2025-05-07T20:33:46.0809081Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:46.0809392Z 2025-05-07T20:33:46.0809642Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:46.0809981Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:46.0810284Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:46.0810613Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:46.0810976Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:46.0811304Z 2025-05-07T20:33:46.0811511Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:46.0811716Z 2025-05-07T20:33:46.0811823Z moe/activation_test.py:126: 2025-05-07T20:33:46.0812158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.0812510Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:46.0812854Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:46.0813680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:46.0814479Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:46.0815223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:46.0816008Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:46.0816746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:46.0817555Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:46.0818334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:46.0819020Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:46.0819694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:46.0820248Z fn() 2025-05-07T20:33:46.0820789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:46.0821409Z self.fn.run( 2025-05-07T20:33:46.0821909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:46.0822480Z kernel = self.compile( 2025-05-07T20:33:46.0823058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:46.0823755Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:46.0824182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.0824425Z 2025-05-07T20:33:46.0824692Z self = 2025-05-07T20:33:46.0826179Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:46.0827691Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1da22b60>} 2025-05-07T20:33:46.0829112Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:46.0830214Z context = 2025-05-07T20:33:46.0830528Z 2025-05-07T20:33:46.0830709Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:46.0831264Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:46.0831759Z module_map=module_map) 2025-05-07T20:33:46.0832144Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:46.0832517Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:46.0832791Z E ^ 2025-05-07T20:33:46.0833281Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:46.0833767Z 2025-05-07T20:33:46.0834206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:46.0834754Z 2025-05-07T20:33:46.0834866Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:46.0835292Z self=, 2025-05-07T20:33:46.0835711Z T=1, 2025-05-07T20:33:46.0835911Z D=5120, 2025-05-07T20:33:46.0836107Z scale_ub=None, 2025-05-07T20:33:46.0836327Z contiguous=True, 2025-05-07T20:33:46.0836559Z compiled=False, 2025-05-07T20:33:46.0836764Z ) 2025-05-07T20:33:46.2336594Z self = 2025-05-07T20:33:46.2337305Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:46.2337678Z 2025-05-07T20:33:46.2337788Z @given( 2025-05-07T20:33:46.2338104Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:46.2338561Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:46.2338875Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:46.2339204Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:46.2339605Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:46.2339908Z ) 2025-05-07T20:33:46.2340264Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:46.2340724Z def test_silu_mul_quant( 2025-05-07T20:33:46.2340970Z self, 2025-05-07T20:33:46.2341221Z T: int, 2025-05-07T20:33:46.2341421Z D: int, 2025-05-07T20:33:46.2341639Z scale_ub: Optional[float], 2025-05-07T20:33:46.2341907Z contiguous: bool, 2025-05-07T20:33:46.2342152Z compiled: bool, 2025-05-07T20:33:46.2342375Z ) -> None: 2025-05-07T20:33:46.2342587Z torch.manual_seed(2025) 2025-05-07T20:33:46.2342833Z 2025-05-07T20:33:46.2343111Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:46.2343460Z 2025-05-07T20:33:46.2343646Z x_sign = torch.sign(x) 2025-05-07T20:33:46.2343936Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:46.2344250Z x = x_sign * x_clamp 2025-05-07T20:33:46.2344491Z x0 = x[:, :D] 2025-05-07T20:33:46.2344707Z x1 = x[:, D:] 2025-05-07T20:33:46.2344917Z 2025-05-07T20:33:46.2345095Z if contiguous: 2025-05-07T20:33:46.2345390Z x0 = x0.contiguous() 2025-05-07T20:33:46.2345653Z x1 = x1.contiguous() 2025-05-07T20:33:46.2345896Z 2025-05-07T20:33:46.2346090Z if scale_ub is not None: 2025-05-07T20:33:46.2346368Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:46.2346700Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:46.2347013Z ) 2025-05-07T20:33:46.2347210Z else: 2025-05-07T20:33:46.2347426Z scale_ub_tensor = None 2025-05-07T20:33:46.2347693Z 2025-05-07T20:33:46.2347927Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:46.2348246Z op = silu_mul_quant 2025-05-07T20:33:46.2348492Z if compiled: 2025-05-07T20:33:46.2348739Z op = torch.compile(op) 2025-05-07T20:33:46.2349038Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.2349311Z 2025-05-07T20:33:46.2349504Z > y_fp8, y_scale = fn() 2025-05-07T20:33:46.2349667Z 2025-05-07T20:33:46.2349775Z moe/activation_test.py:117: 2025-05-07T20:33:46.2350063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.2350406Z moe/activation_test.py:115: in fn 2025-05-07T20:33:46.2350694Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.2351407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:46.2352130Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:46.2352690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:46.2353402Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:46.2354091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:46.2354650Z kernel = self.compile( 2025-05-07T20:33:46.2355215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:46.2355905Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:46.2356303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.2356542Z 2025-05-07T20:33:46.2356748Z self = 2025-05-07T20:33:46.2357873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:46.2359397Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1da239c0>} 2025-05-07T20:33:46.2360809Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:46.2361936Z context = 2025-05-07T20:33:46.2362243Z 2025-05-07T20:33:46.2362413Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:46.2362957Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:46.2363447Z module_map=module_map) 2025-05-07T20:33:46.2363822Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:46.2364193Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:46.2364458Z E ^ 2025-05-07T20:33:46.2364942Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:46.2365418Z 2025-05-07T20:33:46.2365899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:46.2366448Z 2025-05-07T20:33:46.2366559Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:46.2366982Z self=, 2025-05-07T20:33:46.2367400Z T=128, 2025-05-07T20:33:46.2367594Z D=5120, 2025-05-07T20:33:46.2367789Z scale_ub=None, 2025-05-07T20:33:46.2368010Z contiguous=False, 2025-05-07T20:33:46.2368238Z compiled=True, 2025-05-07T20:33:46.2368448Z ) 2025-05-07T20:33:46.2368769Z self = 2025-05-07T20:33:46.2369285Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:46.2369569Z 2025-05-07T20:33:46.2369655Z @given( 2025-05-07T20:33:46.2369888Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:46.2370217Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:46.2370536Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:46.2370870Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:46.2371220Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:46.2371516Z ) 2025-05-07T20:33:46.2371874Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:46.2372334Z def test_silu_mul_quant( 2025-05-07T20:33:46.2372588Z self, 2025-05-07T20:33:46.2372788Z T: int, 2025-05-07T20:33:46.2372975Z D: int, 2025-05-07T20:33:46.2373197Z scale_ub: Optional[float], 2025-05-07T20:33:46.2373483Z contiguous: bool, 2025-05-07T20:33:46.2373724Z compiled: bool, 2025-05-07T20:33:46.2373950Z ) -> None: 2025-05-07T20:33:46.2374166Z torch.manual_seed(2025) 2025-05-07T20:33:46.2374402Z 2025-05-07T20:33:46.2374812Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:46.2375167Z 2025-05-07T20:33:46.2375354Z x_sign = torch.sign(x) 2025-05-07T20:33:46.2375648Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:46.2375967Z x = x_sign * x_clamp 2025-05-07T20:33:46.2376204Z x0 = x[:, :D] 2025-05-07T20:33:46.2376420Z x1 = x[:, D:] 2025-05-07T20:33:46.2376628Z 2025-05-07T20:33:46.2376806Z if contiguous: 2025-05-07T20:33:46.2377034Z x0 = x0.contiguous() 2025-05-07T20:33:46.2377293Z x1 = x1.contiguous() 2025-05-07T20:33:46.2377530Z 2025-05-07T20:33:46.2377776Z if scale_ub is not None: 2025-05-07T20:33:46.2378051Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:46.2378384Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:46.2378698Z ) 2025-05-07T20:33:46.2378884Z else: 2025-05-07T20:33:46.2379156Z scale_ub_tensor = None 2025-05-07T20:33:46.2379408Z 2025-05-07T20:33:46.2379638Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:46.2379961Z op = silu_mul_quant 2025-05-07T20:33:46.2380205Z if compiled: 2025-05-07T20:33:46.2380548Z op = torch.compile(op) 2025-05-07T20:33:46.2380846Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.2381120Z 2025-05-07T20:33:46.2381309Z > y_fp8, y_scale = fn() 2025-05-07T20:33:46.2381472Z 2025-05-07T20:33:46.2381572Z moe/activation_test.py:117: 2025-05-07T20:33:46.2381859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.2382200Z moe/activation_test.py:115: in fn 2025-05-07T20:33:46.2382481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.2383074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:46.2383652Z return fn(*args, **kwargs) 2025-05-07T20:33:46.2384341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:46.2385107Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:46.2385670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:46.2386376Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:46.2387066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:46.2387622Z kernel = self.compile( 2025-05-07T20:33:46.2388179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:46.2388863Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:46.2389266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.2389497Z 2025-05-07T20:33:46.2389714Z self = 2025-05-07T20:33:46.2390831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:46.2392255Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1da20a40>} 2025-05-07T20:33:46.2393655Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:46.2394730Z context = 2025-05-07T20:33:46.2395029Z 2025-05-07T20:33:46.2395199Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:46.2395726Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:46.2396204Z module_map=module_map) 2025-05-07T20:33:46.2396570Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:46.2396921Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:46.2397181Z E ^ 2025-05-07T20:33:46.2397659Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:46.2398126Z 2025-05-07T20:33:46.2398562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:46.2399152Z 2025-05-07T20:33:46.2399257Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:46.2399680Z self=, 2025-05-07T20:33:46.2400136Z T=128, 2025-05-07T20:33:46.2400317Z D=7168, 2025-05-07T20:33:46.2400514Z scale_ub=1200.0, 2025-05-07T20:33:46.2400738Z contiguous=False, 2025-05-07T20:33:46.2400969Z compiled=False, 2025-05-07T20:33:46.2401178Z ) 2025-05-07T20:33:46.3541545Z self = 2025-05-07T20:33:46.3542364Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:46.3542805Z 2025-05-07T20:33:46.3542914Z @given( 2025-05-07T20:33:46.3543232Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:46.3543648Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:46.3543969Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:46.3544296Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:46.3544627Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:46.3544915Z ) 2025-05-07T20:33:46.3545265Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:46.3545723Z def test_silu_mul_quant( 2025-05-07T20:33:46.3545962Z self, 2025-05-07T20:33:46.3546150Z T: int, 2025-05-07T20:33:46.3546461Z D: int, 2025-05-07T20:33:46.3546687Z scale_ub: Optional[float], 2025-05-07T20:33:46.3546968Z contiguous: bool, 2025-05-07T20:33:46.3547212Z compiled: bool, 2025-05-07T20:33:46.3547438Z ) -> None: 2025-05-07T20:33:46.3547646Z torch.manual_seed(2025) 2025-05-07T20:33:46.3547885Z 2025-05-07T20:33:46.3548157Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:46.3548513Z 2025-05-07T20:33:46.3548705Z x_sign = torch.sign(x) 2025-05-07T20:33:46.3549045Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:46.3549363Z x = x_sign * x_clamp 2025-05-07T20:33:46.3549597Z x0 = x[:, :D] 2025-05-07T20:33:46.3549812Z x1 = x[:, D:] 2025-05-07T20:33:46.3550015Z 2025-05-07T20:33:46.3550196Z if contiguous: 2025-05-07T20:33:46.3550426Z x0 = x0.contiguous() 2025-05-07T20:33:46.3550687Z x1 = x1.contiguous() 2025-05-07T20:33:46.3550927Z 2025-05-07T20:33:46.3551112Z if scale_ub is not None: 2025-05-07T20:33:46.3551388Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:46.3551719Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:46.3552030Z ) 2025-05-07T20:33:46.3552221Z else: 2025-05-07T20:33:46.3552424Z scale_ub_tensor = None 2025-05-07T20:33:46.3552681Z 2025-05-07T20:33:46.3552909Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:46.3553222Z op = silu_mul_quant 2025-05-07T20:33:46.3553474Z if compiled: 2025-05-07T20:33:46.3553721Z op = torch.compile(op) 2025-05-07T20:33:46.3554025Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.3554296Z 2025-05-07T20:33:46.3554493Z > y_fp8, y_scale = fn() 2025-05-07T20:33:46.3554657Z 2025-05-07T20:33:46.3554758Z moe/activation_test.py:117: 2025-05-07T20:33:46.3555052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.3555410Z moe/activation_test.py:115: in fn 2025-05-07T20:33:46.3555693Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.3556415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:46.3557133Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:46.3557693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:46.3558485Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:46.3559235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:46.3559789Z kernel = self.compile( 2025-05-07T20:33:46.3560415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:46.3561113Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:46.3561583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.3561820Z 2025-05-07T20:33:46.3562031Z self = 2025-05-07T20:33:46.3563155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:46.3564585Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c08538a40>} 2025-05-07T20:33:46.3565994Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:46.3567113Z context = 2025-05-07T20:33:46.3567422Z 2025-05-07T20:33:46.3567591Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:46.3568134Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:46.3568619Z module_map=module_map) 2025-05-07T20:33:46.3568986Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:46.3569354Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:46.3569624Z E ^ 2025-05-07T20:33:46.3570103Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:46.3570580Z 2025-05-07T20:33:46.3571019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:46.3571567Z 2025-05-07T20:33:46.3571675Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:46.3572106Z self=, 2025-05-07T20:33:46.3572526Z T=128, 2025-05-07T20:33:46.3572711Z D=5120, 2025-05-07T20:33:46.3572900Z scale_ub=None, 2025-05-07T20:33:46.3573110Z contiguous=False, 2025-05-07T20:33:46.3573335Z compiled=False, 2025-05-07T20:33:46.3573537Z ) 2025-05-07T20:33:46.3573851Z self = 2025-05-07T20:33:46.3574361Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:46.3574799Z 2025-05-07T20:33:46.3574883Z @given( 2025-05-07T20:33:46.3575110Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:46.3575421Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:46.3575733Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:46.3576131Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:46.3576500Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:46.3576785Z ) 2025-05-07T20:33:46.3577142Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:46.3577593Z def test_silu_mul_quant( 2025-05-07T20:33:46.3577842Z self, 2025-05-07T20:33:46.3578034Z T: int, 2025-05-07T20:33:46.3578222Z D: int, 2025-05-07T20:33:46.3578439Z scale_ub: Optional[float], 2025-05-07T20:33:46.3578709Z contiguous: bool, 2025-05-07T20:33:46.3579011Z compiled: bool, 2025-05-07T20:33:46.3579221Z ) -> None: 2025-05-07T20:33:46.3579436Z torch.manual_seed(2025) 2025-05-07T20:33:46.3579677Z 2025-05-07T20:33:46.3579945Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:46.3580296Z 2025-05-07T20:33:46.3580538Z x_sign = torch.sign(x) 2025-05-07T20:33:46.3580828Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:46.3581144Z x = x_sign * x_clamp 2025-05-07T20:33:46.3581380Z x0 = x[:, :D] 2025-05-07T20:33:46.3581627Z x1 = x[:, D:] 2025-05-07T20:33:46.3581829Z 2025-05-07T20:33:46.3582011Z if contiguous: 2025-05-07T20:33:46.3582234Z x0 = x0.contiguous() 2025-05-07T20:33:46.3582493Z x1 = x1.contiguous() 2025-05-07T20:33:46.3582736Z 2025-05-07T20:33:46.3582918Z if scale_ub is not None: 2025-05-07T20:33:46.3583187Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:46.3583525Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:46.3583837Z ) 2025-05-07T20:33:46.3584021Z else: 2025-05-07T20:33:46.3584236Z scale_ub_tensor = None 2025-05-07T20:33:46.3584486Z 2025-05-07T20:33:46.3584712Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:46.3585029Z op = silu_mul_quant 2025-05-07T20:33:46.3585276Z if compiled: 2025-05-07T20:33:46.3585515Z op = torch.compile(op) 2025-05-07T20:33:46.3585866Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.3586151Z 2025-05-07T20:33:46.3586337Z > y_fp8, y_scale = fn() 2025-05-07T20:33:46.3586508Z 2025-05-07T20:33:46.3586604Z moe/activation_test.py:117: 2025-05-07T20:33:46.3586896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.3587224Z moe/activation_test.py:115: in fn 2025-05-07T20:33:46.3587504Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.3588218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:46.3588942Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:46.3589498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:46.3596670Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:46.3597412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:46.3597986Z kernel = self.compile( 2025-05-07T20:33:46.3598556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:46.3599241Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:46.3599665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.3599916Z 2025-05-07T20:33:46.3600127Z self = 2025-05-07T20:33:46.3601253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:46.3602684Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1da2c400>} 2025-05-07T20:33:46.3604094Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:46.3605183Z context = 2025-05-07T20:33:46.3605491Z 2025-05-07T20:33:46.3605991Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:46.3606530Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:46.3607004Z module_map=module_map) 2025-05-07T20:33:46.3607416Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:46.3607784Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:46.3608054Z E ^ 2025-05-07T20:33:46.3608567Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:46.3609122Z 2025-05-07T20:33:46.3609563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:46.3610108Z 2025-05-07T20:33:46.3610218Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:46.3610646Z self=, 2025-05-07T20:33:46.3611078Z T=128, 2025-05-07T20:33:46.3611274Z D=5120, 2025-05-07T20:33:46.3611467Z scale_ub=1200.0, 2025-05-07T20:33:46.3611702Z contiguous=True, 2025-05-07T20:33:46.3611934Z compiled=False, 2025-05-07T20:33:46.3612147Z ) 2025-05-07T20:33:46.5344167Z self = 2025-05-07T20:33:46.5345013Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:46.5345449Z 2025-05-07T20:33:46.5345561Z @given( 2025-05-07T20:33:46.5346014Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:46.5346445Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:46.5346842Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:46.5347242Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:46.5347582Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:46.5347878Z ) 2025-05-07T20:33:46.5348236Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:46.5348696Z def test_silu_mul_quant( 2025-05-07T20:33:46.5348939Z self, 2025-05-07T20:33:46.5349138Z T: int, 2025-05-07T20:33:46.5349332Z D: int, 2025-05-07T20:33:46.5349545Z scale_ub: Optional[float], 2025-05-07T20:33:46.5349819Z contiguous: bool, 2025-05-07T20:33:46.5350055Z compiled: bool, 2025-05-07T20:33:46.5350290Z ) -> None: 2025-05-07T20:33:46.5350505Z torch.manual_seed(2025) 2025-05-07T20:33:46.5350760Z 2025-05-07T20:33:46.5351039Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:46.5351391Z 2025-05-07T20:33:46.5351584Z x_sign = torch.sign(x) 2025-05-07T20:33:46.5351884Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:46.5352197Z x = x_sign * x_clamp 2025-05-07T20:33:46.5352437Z x0 = x[:, :D] 2025-05-07T20:33:46.5352654Z x1 = x[:, D:] 2025-05-07T20:33:46.5352861Z 2025-05-07T20:33:46.5353051Z if contiguous: 2025-05-07T20:33:46.5353285Z x0 = x0.contiguous() 2025-05-07T20:33:46.5353541Z x1 = x1.contiguous() 2025-05-07T20:33:46.5353786Z 2025-05-07T20:33:46.5353979Z if scale_ub is not None: 2025-05-07T20:33:46.5354254Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:46.5354604Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:46.5354922Z ) 2025-05-07T20:33:46.5355108Z else: 2025-05-07T20:33:46.5355326Z scale_ub_tensor = None 2025-05-07T20:33:46.5355581Z 2025-05-07T20:33:46.5355816Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:46.5356137Z op = silu_mul_quant 2025-05-07T20:33:46.5356397Z if compiled: 2025-05-07T20:33:46.5356651Z op = torch.compile(op) 2025-05-07T20:33:46.5356950Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.5357226Z 2025-05-07T20:33:46.5357417Z > y_fp8, y_scale = fn() 2025-05-07T20:33:46.5357662Z 2025-05-07T20:33:46.5357762Z moe/activation_test.py:117: 2025-05-07T20:33:46.5358064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.5358412Z moe/activation_test.py:115: in fn 2025-05-07T20:33:46.5358765Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.5359492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:46.5360219Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:46.5360835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:46.5361542Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:46.5362239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:46.5362795Z kernel = self.compile( 2025-05-07T20:33:46.5363357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:46.5364036Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:46.5364441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.5364672Z 2025-05-07T20:33:46.5364889Z self = 2025-05-07T20:33:46.5366044Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:46.5367478Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1da2d300>} 2025-05-07T20:33:46.5368888Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:46.5370029Z context = 2025-05-07T20:33:46.5370329Z 2025-05-07T20:33:46.5370506Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:46.5371048Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:46.5371530Z module_map=module_map) 2025-05-07T20:33:46.5371910Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:46.5372272Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:46.5372540Z E ^ 2025-05-07T20:33:46.5373025Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:46.5373497Z 2025-05-07T20:33:46.5373942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:46.5374483Z 2025-05-07T20:33:46.5374685Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:46.5375119Z self=, 2025-05-07T20:33:46.5375541Z T=1, 2025-05-07T20:33:46.5375736Z D=7168, 2025-05-07T20:33:46.5375936Z scale_ub=1200.0, 2025-05-07T20:33:46.5376164Z contiguous=True, 2025-05-07T20:33:46.5376395Z compiled=True, 2025-05-07T20:33:46.5376600Z ) 2025-05-07T20:33:46.5376930Z self = 2025-05-07T20:33:46.5377433Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:46.5377704Z 2025-05-07T20:33:46.5377787Z @given( 2025-05-07T20:33:46.5378025Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:46.5378353Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:46.5378748Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:46.5379103Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:46.5379445Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:46.5379749Z ) 2025-05-07T20:33:46.5380143Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:46.5380601Z def test_silu_mul_quant( 2025-05-07T20:33:46.5380848Z self, 2025-05-07T20:33:46.5381044Z T: int, 2025-05-07T20:33:46.5381236Z D: int, 2025-05-07T20:33:46.5381456Z scale_ub: Optional[float], 2025-05-07T20:33:46.5381765Z contiguous: bool, 2025-05-07T20:33:46.5382001Z compiled: bool, 2025-05-07T20:33:46.5382223Z ) -> None: 2025-05-07T20:33:46.5382428Z torch.manual_seed(2025) 2025-05-07T20:33:46.5382664Z 2025-05-07T20:33:46.5382939Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:46.5383286Z 2025-05-07T20:33:46.5383483Z x_sign = torch.sign(x) 2025-05-07T20:33:46.5383773Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:46.5384076Z x = x_sign * x_clamp 2025-05-07T20:33:46.5384320Z x0 = x[:, :D] 2025-05-07T20:33:46.5384529Z x1 = x[:, D:] 2025-05-07T20:33:46.5384729Z 2025-05-07T20:33:46.5384918Z if contiguous: 2025-05-07T20:33:46.5385150Z x0 = x0.contiguous() 2025-05-07T20:33:46.5385409Z x1 = x1.contiguous() 2025-05-07T20:33:46.5385690Z 2025-05-07T20:33:46.5385882Z if scale_ub is not None: 2025-05-07T20:33:46.5386158Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:46.5386491Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:46.5386805Z ) 2025-05-07T20:33:46.5386999Z else: 2025-05-07T20:33:46.5387210Z scale_ub_tensor = None 2025-05-07T20:33:46.5387469Z 2025-05-07T20:33:46.5387704Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:46.5388019Z op = silu_mul_quant 2025-05-07T20:33:46.5388270Z if compiled: 2025-05-07T20:33:46.5388521Z op = torch.compile(op) 2025-05-07T20:33:46.5388818Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.5389101Z 2025-05-07T20:33:46.5389293Z > y_fp8, y_scale = fn() 2025-05-07T20:33:46.5389457Z 2025-05-07T20:33:46.5389556Z moe/activation_test.py:117: 2025-05-07T20:33:46.5389850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.5390194Z moe/activation_test.py:115: in fn 2025-05-07T20:33:46.5390481Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.5391053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:46.5391637Z return fn(*args, **kwargs) 2025-05-07T20:33:46.5392324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:46.5393048Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:46.5393600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:46.5394314Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:46.5395008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:46.5395563Z kernel = self.compile( 2025-05-07T20:33:46.5396124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:46.5396813Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:46.5397216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.5397449Z 2025-05-07T20:33:46.5397660Z self = 2025-05-07T20:33:46.5398832Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:46.5400295Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1da2eac0>} 2025-05-07T20:33:46.5401708Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:46.5402852Z context = 2025-05-07T20:33:46.5403155Z 2025-05-07T20:33:46.5403323Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:46.5403861Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:46.5404355Z module_map=module_map) 2025-05-07T20:33:46.5404731Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:46.5405096Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:46.5405361Z E ^ 2025-05-07T20:33:46.5405839Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:46.5406315Z 2025-05-07T20:33:46.5406793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:46.5407344Z 2025-05-07T20:33:46.5407447Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:46.5407876Z self=, 2025-05-07T20:33:46.5408297Z T=1, 2025-05-07T20:33:46.5408485Z D=7168, 2025-05-07T20:33:46.5408687Z scale_ub=1200.0, 2025-05-07T20:33:46.5408913Z contiguous=False, 2025-05-07T20:33:46.5409156Z compiled=True, 2025-05-07T20:33:46.5409359Z ) 2025-05-07T20:33:46.6732533Z self = 2025-05-07T20:33:46.6733179Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:46.6733462Z 2025-05-07T20:33:46.6733560Z @given( 2025-05-07T20:33:46.6733788Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:46.6734104Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:46.6734408Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:46.6734873Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:46.6735198Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:46.6735490Z ) 2025-05-07T20:33:46.6735839Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:46.6736288Z def test_silu_mul_quant( 2025-05-07T20:33:46.6736525Z self, 2025-05-07T20:33:46.6736719Z T: int, 2025-05-07T20:33:46.6736908Z D: int, 2025-05-07T20:33:46.6737122Z scale_ub: Optional[float], 2025-05-07T20:33:46.6737392Z contiguous: bool, 2025-05-07T20:33:46.6737624Z compiled: bool, 2025-05-07T20:33:46.6737846Z ) -> None: 2025-05-07T20:33:46.6738061Z torch.manual_seed(2025) 2025-05-07T20:33:46.6738299Z 2025-05-07T20:33:46.6738568Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:46.6738922Z 2025-05-07T20:33:46.6739111Z x_sign = torch.sign(x) 2025-05-07T20:33:46.6739393Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:46.6739705Z x = x_sign * x_clamp 2025-05-07T20:33:46.6739940Z x0 = x[:, :D] 2025-05-07T20:33:46.6740150Z x1 = x[:, D:] 2025-05-07T20:33:46.6740357Z 2025-05-07T20:33:46.6740541Z if contiguous: 2025-05-07T20:33:46.6740768Z x0 = x0.contiguous() 2025-05-07T20:33:46.6741025Z x1 = x1.contiguous() 2025-05-07T20:33:46.6741387Z 2025-05-07T20:33:46.6741572Z if scale_ub is not None: 2025-05-07T20:33:46.6741841Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:46.6742186Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:46.6742493Z ) 2025-05-07T20:33:46.6742741Z else: 2025-05-07T20:33:46.6742952Z scale_ub_tensor = None 2025-05-07T20:33:46.6743196Z 2025-05-07T20:33:46.6743428Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:46.6743741Z op = silu_mul_quant 2025-05-07T20:33:46.6744046Z if compiled: 2025-05-07T20:33:46.6744289Z op = torch.compile(op) 2025-05-07T20:33:46.6744586Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.6744862Z 2025-05-07T20:33:46.6745045Z > y_fp8, y_scale = fn() 2025-05-07T20:33:46.6745213Z 2025-05-07T20:33:46.6745308Z moe/activation_test.py:117: 2025-05-07T20:33:46.6745605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.6745941Z moe/activation_test.py:115: in fn 2025-05-07T20:33:46.6746222Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.6746801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:46.6747383Z return fn(*args, **kwargs) 2025-05-07T20:33:46.6748122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:46.6748858Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:46.6749418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:46.6750129Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:46.6750822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:46.6751384Z kernel = self.compile( 2025-05-07T20:33:46.6751948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:46.6752634Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:46.6753058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.6753299Z 2025-05-07T20:33:46.6753517Z self = 2025-05-07T20:33:46.6754644Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:46.6756073Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1da439c0>} 2025-05-07T20:33:46.6757484Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:46.6758690Z context = 2025-05-07T20:33:46.6759075Z 2025-05-07T20:33:46.6759273Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:46.6759872Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:46.6760361Z module_map=module_map) 2025-05-07T20:33:46.6760837Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:46.6761259Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:46.6761514Z E ^ 2025-05-07T20:33:46.6762156Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:46.6762771Z 2025-05-07T20:33:46.6763211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:46.6763753Z 2025-05-07T20:33:46.6763862Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:46.6764416Z self=, 2025-05-07T20:33:46.6764852Z T=1, 2025-05-07T20:33:46.6765037Z D=7168, 2025-05-07T20:33:46.6765225Z scale_ub=None, 2025-05-07T20:33:46.6765443Z contiguous=False, 2025-05-07T20:33:46.6765666Z compiled=True, 2025-05-07T20:33:46.6765913Z ) 2025-05-07T20:33:46.7634963Z self = 2025-05-07T20:33:46.7635675Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:46.7636053Z 2025-05-07T20:33:46.7636170Z @given( 2025-05-07T20:33:46.7636395Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:46.7636718Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:46.7637025Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:46.7637356Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:46.7637684Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:46.7637968Z ) 2025-05-07T20:33:46.7638322Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:46.7638773Z def test_silu_mul_quant( 2025-05-07T20:33:46.7639119Z self, 2025-05-07T20:33:46.7639313Z T: int, 2025-05-07T20:33:46.7639501Z D: int, 2025-05-07T20:33:46.7639715Z scale_ub: Optional[float], 2025-05-07T20:33:46.7639982Z contiguous: bool, 2025-05-07T20:33:46.7640211Z compiled: bool, 2025-05-07T20:33:46.7640427Z ) -> None: 2025-05-07T20:33:46.7640637Z torch.manual_seed(2025) 2025-05-07T20:33:46.7640869Z 2025-05-07T20:33:46.7641138Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:46.7641492Z 2025-05-07T20:33:46.7641685Z x_sign = torch.sign(x) 2025-05-07T20:33:46.7641968Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:46.7642283Z x = x_sign * x_clamp 2025-05-07T20:33:46.7642526Z x0 = x[:, :D] 2025-05-07T20:33:46.7642733Z x1 = x[:, D:] 2025-05-07T20:33:46.7642933Z 2025-05-07T20:33:46.7643113Z if contiguous: 2025-05-07T20:33:46.7643338Z x0 = x0.contiguous() 2025-05-07T20:33:46.7643599Z x1 = x1.contiguous() 2025-05-07T20:33:46.7643842Z 2025-05-07T20:33:46.7644028Z if scale_ub is not None: 2025-05-07T20:33:46.7644295Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:46.7644629Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:46.7644936Z ) 2025-05-07T20:33:46.7645129Z else: 2025-05-07T20:33:46.7645337Z scale_ub_tensor = None 2025-05-07T20:33:46.7645584Z 2025-05-07T20:33:46.7645815Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:46.7646128Z op = silu_mul_quant 2025-05-07T20:33:46.7646379Z if compiled: 2025-05-07T20:33:46.7646622Z op = torch.compile(op) 2025-05-07T20:33:46.7646921Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.7647194Z 2025-05-07T20:33:46.7647376Z y_fp8, y_scale = fn() 2025-05-07T20:33:46.7647660Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:46.7647957Z 2025-05-07T20:33:46.7648187Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:46.7648536Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:46.7648877Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:46.7649185Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:46.7649547Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:46.7649861Z 2025-05-07T20:33:46.7650053Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:46.7650329Z 2025-05-07T20:33:46.7650427Z moe/activation_test.py:126: 2025-05-07T20:33:46.7650727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.7651069Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:46.7651453Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:46.7652278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:46.7653069Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:46.7653699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:46.7654416Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:46.7655246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:46.7656004Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:46.7656764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:46.7657435Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:46.7658069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:46.7658653Z fn() 2025-05-07T20:33:46.7659227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:46.7659836Z self.fn.run( 2025-05-07T20:33:46.7660317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:46.7660865Z kernel = self.compile( 2025-05-07T20:33:46.7661424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:46.7662110Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:46.7662516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.7662753Z 2025-05-07T20:33:46.7662964Z self = 2025-05-07T20:33:46.7664085Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:46.7665510Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d968b80>} 2025-05-07T20:33:46.7666914Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:46.7667999Z context = 2025-05-07T20:33:46.7668297Z 2025-05-07T20:33:46.7668464Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:46.7669002Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:46.7669482Z module_map=module_map) 2025-05-07T20:33:46.7669847Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:46.7670204Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:46.7670471Z E ^ 2025-05-07T20:33:46.7670945Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:46.7671415Z 2025-05-07T20:33:46.7671850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:46.7672450Z 2025-05-07T20:33:46.7672554Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:46.7679494Z self=, 2025-05-07T20:33:46.7679927Z T=1, 2025-05-07T20:33:46.7680117Z D=5120, 2025-05-07T20:33:46.7680394Z scale_ub=1200.0, 2025-05-07T20:33:46.7680619Z contiguous=False, 2025-05-07T20:33:46.7680852Z compiled=True, 2025-05-07T20:33:46.7681059Z ) 2025-05-07T20:33:46.9226674Z self = 2025-05-07T20:33:46.9227665Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:46.9228068Z 2025-05-07T20:33:46.9228179Z @given( 2025-05-07T20:33:46.9228441Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:46.9228759Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:46.9229063Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:46.9229400Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:46.9229727Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:46.9230025Z ) 2025-05-07T20:33:46.9230377Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:46.9230832Z def test_silu_mul_quant( 2025-05-07T20:33:46.9231089Z self, 2025-05-07T20:33:46.9231281Z T: int, 2025-05-07T20:33:46.9231483Z D: int, 2025-05-07T20:33:46.9231712Z scale_ub: Optional[float], 2025-05-07T20:33:46.9232073Z contiguous: bool, 2025-05-07T20:33:46.9232331Z compiled: bool, 2025-05-07T20:33:46.9232554Z ) -> None: 2025-05-07T20:33:46.9232768Z torch.manual_seed(2025) 2025-05-07T20:33:46.9233007Z 2025-05-07T20:33:46.9233291Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:46.9233638Z 2025-05-07T20:33:46.9233825Z x_sign = torch.sign(x) 2025-05-07T20:33:46.9234116Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:46.9234428Z x = x_sign * x_clamp 2025-05-07T20:33:46.9234662Z x0 = x[:, :D] 2025-05-07T20:33:46.9234875Z x1 = x[:, D:] 2025-05-07T20:33:46.9235074Z 2025-05-07T20:33:46.9235257Z if contiguous: 2025-05-07T20:33:46.9235488Z x0 = x0.contiguous() 2025-05-07T20:33:46.9235748Z x1 = x1.contiguous() 2025-05-07T20:33:46.9235977Z 2025-05-07T20:33:46.9236169Z if scale_ub is not None: 2025-05-07T20:33:46.9236444Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:46.9236779Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:46.9237091Z ) 2025-05-07T20:33:46.9237280Z else: 2025-05-07T20:33:46.9237486Z scale_ub_tensor = None 2025-05-07T20:33:46.9237735Z 2025-05-07T20:33:46.9237968Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:46.9238276Z op = silu_mul_quant 2025-05-07T20:33:46.9238531Z if compiled: 2025-05-07T20:33:46.9238780Z op = torch.compile(op) 2025-05-07T20:33:46.9239068Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.9239346Z 2025-05-07T20:33:46.9239540Z > y_fp8, y_scale = fn() 2025-05-07T20:33:46.9239703Z 2025-05-07T20:33:46.9239815Z moe/activation_test.py:117: 2025-05-07T20:33:46.9240105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.9240443Z moe/activation_test.py:115: in fn 2025-05-07T20:33:46.9240728Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.9241303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:46.9241889Z return fn(*args, **kwargs) 2025-05-07T20:33:46.9242567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:46.9243286Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:46.9243903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:46.9244610Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:46.9245362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:46.9245922Z kernel = self.compile( 2025-05-07T20:33:46.9246492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:46.9247220Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:46.9247633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.9247872Z 2025-05-07T20:33:46.9248083Z self = 2025-05-07T20:33:46.9249210Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:46.9250648Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d969e40>} 2025-05-07T20:33:46.9252101Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:46.9253185Z context = 2025-05-07T20:33:46.9253499Z 2025-05-07T20:33:46.9253668Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:46.9254207Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:46.9254809Z module_map=module_map) 2025-05-07T20:33:46.9255183Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:46.9255544Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:46.9255813Z E ^ 2025-05-07T20:33:46.9256285Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:46.9256761Z 2025-05-07T20:33:46.9257196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:46.9257739Z 2025-05-07T20:33:46.9257843Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:46.9258255Z self=, 2025-05-07T20:33:46.9258661Z T=1, 2025-05-07T20:33:46.9258848Z D=5120, 2025-05-07T20:33:46.9259043Z scale_ub=1200.0, 2025-05-07T20:33:46.9259259Z contiguous=False, 2025-05-07T20:33:46.9259485Z compiled=False, 2025-05-07T20:33:46.9259690Z ) 2025-05-07T20:33:46.9260003Z self = 2025-05-07T20:33:46.9260505Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:46.9260785Z 2025-05-07T20:33:46.9260860Z @given( 2025-05-07T20:33:46.9261087Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:46.9261399Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:46.9261705Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:46.9262034Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:46.9262363Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:46.9262649Z ) 2025-05-07T20:33:46.9262996Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:46.9263446Z def test_silu_mul_quant( 2025-05-07T20:33:46.9263672Z self, 2025-05-07T20:33:46.9263862Z T: int, 2025-05-07T20:33:46.9264061Z D: int, 2025-05-07T20:33:46.9264323Z scale_ub: Optional[float], 2025-05-07T20:33:46.9264600Z contiguous: bool, 2025-05-07T20:33:46.9264835Z compiled: bool, 2025-05-07T20:33:46.9265044Z ) -> None: 2025-05-07T20:33:46.9265258Z torch.manual_seed(2025) 2025-05-07T20:33:46.9265494Z 2025-05-07T20:33:46.9265802Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:46.9266157Z 2025-05-07T20:33:46.9266352Z x_sign = torch.sign(x) 2025-05-07T20:33:46.9266635Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:46.9266988Z x = x_sign * x_clamp 2025-05-07T20:33:46.9267219Z x0 = x[:, :D] 2025-05-07T20:33:46.9267424Z x1 = x[:, D:] 2025-05-07T20:33:46.9267623Z 2025-05-07T20:33:46.9267801Z if contiguous: 2025-05-07T20:33:46.9268023Z x0 = x0.contiguous() 2025-05-07T20:33:46.9268273Z x1 = x1.contiguous() 2025-05-07T20:33:46.9268521Z 2025-05-07T20:33:46.9268702Z if scale_ub is not None: 2025-05-07T20:33:46.9268972Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:46.9269299Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:46.9269613Z ) 2025-05-07T20:33:46.9269797Z else: 2025-05-07T20:33:46.9270005Z scale_ub_tensor = None 2025-05-07T20:33:46.9270253Z 2025-05-07T20:33:46.9270478Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:46.9270798Z op = silu_mul_quant 2025-05-07T20:33:46.9271095Z if compiled: 2025-05-07T20:33:46.9271336Z op = torch.compile(op) 2025-05-07T20:33:46.9271643Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.9271925Z 2025-05-07T20:33:46.9272111Z > y_fp8, y_scale = fn() 2025-05-07T20:33:46.9272282Z 2025-05-07T20:33:46.9272378Z moe/activation_test.py:117: 2025-05-07T20:33:46.9272679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.9273023Z moe/activation_test.py:115: in fn 2025-05-07T20:33:46.9273303Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:46.9274013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:46.9274736Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:46.9275286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:46.9276003Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:46.9276696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:46.9277253Z kernel = self.compile( 2025-05-07T20:33:46.9277806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:46.9278488Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:46.9278946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:46.9279180Z 2025-05-07T20:33:46.9279390Z self = 2025-05-07T20:33:46.9280506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:46.9281926Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d96aac0>} 2025-05-07T20:33:46.9283330Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:46.9284413Z context = 2025-05-07T20:33:46.9284759Z 2025-05-07T20:33:46.9284924Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:46.9285453Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:46.9285974Z module_map=module_map) 2025-05-07T20:33:46.9286344Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:46.9286699Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:46.9286955Z E ^ 2025-05-07T20:33:46.9287474Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:46.9287944Z 2025-05-07T20:33:46.9288379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:46.9288977Z 2025-05-07T20:33:46.9289077Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:46.9289504Z self=, 2025-05-07T20:33:46.9289917Z T=16384, 2025-05-07T20:33:46.9290102Z D=5120, 2025-05-07T20:33:46.9290299Z scale_ub=1200.0, 2025-05-07T20:33:46.9290523Z contiguous=False, 2025-05-07T20:33:46.9290745Z compiled=True, 2025-05-07T20:33:46.9290948Z ) 2025-05-07T20:33:47.0168887Z self = 2025-05-07T20:33:47.0169841Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:47.0170268Z 2025-05-07T20:33:47.0170394Z @given( 2025-05-07T20:33:47.0170706Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:47.0171147Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:47.0171514Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:47.0171849Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:47.0172189Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:47.0172493Z ) 2025-05-07T20:33:47.0172845Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:47.0173297Z def test_silu_mul_quant( 2025-05-07T20:33:47.0173536Z self, 2025-05-07T20:33:47.0173728Z T: int, 2025-05-07T20:33:47.0173914Z D: int, 2025-05-07T20:33:47.0174132Z scale_ub: Optional[float], 2025-05-07T20:33:47.0174410Z contiguous: bool, 2025-05-07T20:33:47.0174744Z compiled: bool, 2025-05-07T20:33:47.0174971Z ) -> None: 2025-05-07T20:33:47.0175182Z torch.manual_seed(2025) 2025-05-07T20:33:47.0175420Z 2025-05-07T20:33:47.0175689Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:47.0176031Z 2025-05-07T20:33:47.0176216Z x_sign = torch.sign(x) 2025-05-07T20:33:47.0176505Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:47.0176821Z x = x_sign * x_clamp 2025-05-07T20:33:47.0177056Z x0 = x[:, :D] 2025-05-07T20:33:47.0177264Z x1 = x[:, D:] 2025-05-07T20:33:47.0177466Z 2025-05-07T20:33:47.0177651Z if contiguous: 2025-05-07T20:33:47.0177875Z x0 = x0.contiguous() 2025-05-07T20:33:47.0178131Z x1 = x1.contiguous() 2025-05-07T20:33:47.0178371Z 2025-05-07T20:33:47.0178557Z if scale_ub is not None: 2025-05-07T20:33:47.0178829Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:47.0179164Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:47.0179470Z ) 2025-05-07T20:33:47.0179656Z else: 2025-05-07T20:33:47.0179864Z scale_ub_tensor = None 2025-05-07T20:33:47.0180111Z 2025-05-07T20:33:47.0180338Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:47.0180650Z op = silu_mul_quant 2025-05-07T20:33:47.0180894Z if compiled: 2025-05-07T20:33:47.0181137Z op = torch.compile(op) 2025-05-07T20:33:47.0181435Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.0181802Z 2025-05-07T20:33:47.0181985Z > y_fp8, y_scale = fn() 2025-05-07T20:33:47.0182154Z 2025-05-07T20:33:47.0182249Z moe/activation_test.py:117: 2025-05-07T20:33:47.0182544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.0182933Z moe/activation_test.py:115: in fn 2025-05-07T20:33:47.0183216Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.0183796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:47.0184438Z return fn(*args, **kwargs) 2025-05-07T20:33:47.0185123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:47.0185852Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:47.0186406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:47.0187119Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:47.0187815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:47.0188368Z kernel = self.compile( 2025-05-07T20:33:47.0188931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:47.0189663Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:47.0190073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.0190316Z 2025-05-07T20:33:47.0190532Z self = 2025-05-07T20:33:47.0191646Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:47.0193082Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1dc7c180>} 2025-05-07T20:33:47.0194499Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:47.0195587Z context = 2025-05-07T20:33:47.0195890Z 2025-05-07T20:33:47.0196064Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:47.0196599Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:47.0197088Z module_map=module_map) 2025-05-07T20:33:47.0197464Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:47.0197822Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:47.0198086Z E ^ 2025-05-07T20:33:47.0198565Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:47.0199037Z 2025-05-07T20:33:47.0199481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:47.0200022Z 2025-05-07T20:33:47.0200127Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:47.0200650Z self=, 2025-05-07T20:33:47.0201080Z T=2048, 2025-05-07T20:33:47.0201258Z D=7168, 2025-05-07T20:33:47.0201468Z scale_ub=1200.0, 2025-05-07T20:33:47.0201690Z contiguous=False, 2025-05-07T20:33:47.0201918Z compiled=True, 2025-05-07T20:33:47.0202113Z ) 2025-05-07T20:33:47.0202429Z self = 2025-05-07T20:33:47.0202989Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:47.0203272Z 2025-05-07T20:33:47.0203354Z @given( 2025-05-07T20:33:47.0203572Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:47.0203886Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:47.0204301Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:47.0204628Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:47.0204957Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:47.0205248Z ) 2025-05-07T20:33:47.0205633Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:47.0206092Z def test_silu_mul_quant( 2025-05-07T20:33:47.0206329Z self, 2025-05-07T20:33:47.0206518Z T: int, 2025-05-07T20:33:47.0206707Z D: int, 2025-05-07T20:33:47.0206923Z scale_ub: Optional[float], 2025-05-07T20:33:47.0207195Z contiguous: bool, 2025-05-07T20:33:47.0207437Z compiled: bool, 2025-05-07T20:33:47.0207658Z ) -> None: 2025-05-07T20:33:47.0207869Z torch.manual_seed(2025) 2025-05-07T20:33:47.0208101Z 2025-05-07T20:33:47.0208375Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:47.0208727Z 2025-05-07T20:33:47.0208916Z x_sign = torch.sign(x) 2025-05-07T20:33:47.0209207Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:47.0209518Z x = x_sign * x_clamp 2025-05-07T20:33:47.0209798Z x0 = x[:, :D] 2025-05-07T20:33:47.0210016Z x1 = x[:, D:] 2025-05-07T20:33:47.0210229Z 2025-05-07T20:33:47.0210410Z if contiguous: 2025-05-07T20:33:47.0210643Z x0 = x0.contiguous() 2025-05-07T20:33:47.0210900Z x1 = x1.contiguous() 2025-05-07T20:33:47.0211137Z 2025-05-07T20:33:47.0211323Z if scale_ub is not None: 2025-05-07T20:33:47.0211598Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:47.0211929Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:47.0212226Z ) 2025-05-07T20:33:47.0212413Z else: 2025-05-07T20:33:47.0212620Z scale_ub_tensor = None 2025-05-07T20:33:47.0212863Z 2025-05-07T20:33:47.0213090Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:47.0213418Z op = silu_mul_quant 2025-05-07T20:33:47.0213657Z if compiled: 2025-05-07T20:33:47.0213899Z op = torch.compile(op) 2025-05-07T20:33:47.0214204Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.0214479Z 2025-05-07T20:33:47.0214797Z > y_fp8, y_scale = fn() 2025-05-07T20:33:47.0214957Z 2025-05-07T20:33:47.0215058Z moe/activation_test.py:117: 2025-05-07T20:33:47.0215346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.0215681Z moe/activation_test.py:115: in fn 2025-05-07T20:33:47.0215966Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.0216544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:47.0217126Z return fn(*args, **kwargs) 2025-05-07T20:33:47.0217819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:47.0218546Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:47.0219105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:47.0219823Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:47.0220524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:47.0221085Z kernel = self.compile( 2025-05-07T20:33:47.0221645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:47.0222393Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:47.0222801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.0223034Z 2025-05-07T20:33:47.0223251Z self = 2025-05-07T20:33:47.0224410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:47.0226126Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1dc7cea0>} 2025-05-07T20:33:47.0227545Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:47.0228641Z context = 2025-05-07T20:33:47.0228949Z 2025-05-07T20:33:47.0229126Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:47.0229675Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:47.0230161Z module_map=module_map) 2025-05-07T20:33:47.0230627Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:47.0230993Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:47.0231271Z E ^ 2025-05-07T20:33:47.0231757Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:47.0232230Z 2025-05-07T20:33:47.0232673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:47.0233217Z 2025-05-07T20:33:47.1387701Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:47.1388345Z self=, 2025-05-07T20:33:47.1389064Z T=1, 2025-05-07T20:33:47.1389335Z D=5120, 2025-05-07T20:33:47.1389616Z scale_ub=None, 2025-05-07T20:33:47.1389914Z contiguous=False, 2025-05-07T20:33:47.1390156Z compiled=False, 2025-05-07T20:33:47.1390365Z ) 2025-05-07T20:33:47.1390688Z self = 2025-05-07T20:33:47.1391198Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:47.1391477Z 2025-05-07T20:33:47.1391556Z @given( 2025-05-07T20:33:47.1391788Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:47.1392106Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:47.1392418Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:47.1392758Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:47.1393094Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:47.1393382Z ) 2025-05-07T20:33:47.1393730Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:47.1394183Z def test_silu_mul_quant( 2025-05-07T20:33:47.1394424Z self, 2025-05-07T20:33:47.1394626Z T: int, 2025-05-07T20:33:47.1394818Z D: int, 2025-05-07T20:33:47.1395038Z scale_ub: Optional[float], 2025-05-07T20:33:47.1395310Z contiguous: bool, 2025-05-07T20:33:47.1395547Z compiled: bool, 2025-05-07T20:33:47.1395774Z ) -> None: 2025-05-07T20:33:47.1395996Z torch.manual_seed(2025) 2025-05-07T20:33:47.1396253Z 2025-05-07T20:33:47.1396534Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:47.1396896Z 2025-05-07T20:33:47.1397100Z x_sign = torch.sign(x) 2025-05-07T20:33:47.1397396Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:47.1397833Z x = x_sign * x_clamp 2025-05-07T20:33:47.1398091Z x0 = x[:, :D] 2025-05-07T20:33:47.1398314Z x1 = x[:, D:] 2025-05-07T20:33:47.1398541Z 2025-05-07T20:33:47.1398744Z if contiguous: 2025-05-07T20:33:47.1399020Z x0 = x0.contiguous() 2025-05-07T20:33:47.1399288Z x1 = x1.contiguous() 2025-05-07T20:33:47.1399606Z 2025-05-07T20:33:47.1399797Z if scale_ub is not None: 2025-05-07T20:33:47.1400082Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:47.1400435Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:47.1400850Z ) 2025-05-07T20:33:47.1401058Z else: 2025-05-07T20:33:47.1401284Z scale_ub_tensor = None 2025-05-07T20:33:47.1401567Z 2025-05-07T20:33:47.1408656Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:47.1408996Z op = silu_mul_quant 2025-05-07T20:33:47.1409290Z if compiled: 2025-05-07T20:33:47.1409556Z op = torch.compile(op) 2025-05-07T20:33:47.1409863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.1410141Z 2025-05-07T20:33:47.1410335Z > y_fp8, y_scale = fn() 2025-05-07T20:33:47.1410502Z 2025-05-07T20:33:47.1410606Z moe/activation_test.py:117: 2025-05-07T20:33:47.1410906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.1411245Z moe/activation_test.py:115: in fn 2025-05-07T20:33:47.1411533Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.1412359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:47.1413091Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:47.1413661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:47.1414380Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:47.1415228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:47.1415786Z kernel = self.compile( 2025-05-07T20:33:47.1416356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:47.1417051Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:47.1417475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.1417718Z 2025-05-07T20:33:47.1417937Z self = 2025-05-07T20:33:47.1419067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:47.1420503Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1dc7de40>} 2025-05-07T20:33:47.1421925Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:47.1423010Z context = 2025-05-07T20:33:47.1423319Z 2025-05-07T20:33:47.1423495Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:47.1424048Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:47.1424538Z module_map=module_map) 2025-05-07T20:33:47.1424921Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:47.1425292Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:47.1425814Z E ^ 2025-05-07T20:33:47.1426454Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:47.1427014Z 2025-05-07T20:33:47.1427451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:47.1428001Z 2025-05-07T20:33:47.1428189Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:47.1428611Z self=, 2025-05-07T20:33:47.1429027Z T=4096, 2025-05-07T20:33:47.1429219Z D=7168, 2025-05-07T20:33:47.1429476Z scale_ub=1200.0, 2025-05-07T20:33:47.1429698Z contiguous=False, 2025-05-07T20:33:47.1429927Z compiled=False, 2025-05-07T20:33:47.1430130Z ) 2025-05-07T20:33:47.1430446Z self = 2025-05-07T20:33:47.1430959Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:47.1431246Z 2025-05-07T20:33:47.1431328Z @given( 2025-05-07T20:33:47.1431554Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:47.1431877Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:47.1432186Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:47.1432517Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:47.1432849Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:47.1433143Z ) 2025-05-07T20:33:47.1433558Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:47.1434014Z def test_silu_mul_quant( 2025-05-07T20:33:47.1434257Z self, 2025-05-07T20:33:47.1434452Z T: int, 2025-05-07T20:33:47.1434644Z D: int, 2025-05-07T20:33:47.1434864Z scale_ub: Optional[float], 2025-05-07T20:33:47.1435136Z contiguous: bool, 2025-05-07T20:33:47.1435369Z compiled: bool, 2025-05-07T20:33:47.1435593Z ) -> None: 2025-05-07T20:33:47.1435804Z torch.manual_seed(2025) 2025-05-07T20:33:47.1436041Z 2025-05-07T20:33:47.1436316Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:47.1436669Z 2025-05-07T20:33:47.1436863Z x_sign = torch.sign(x) 2025-05-07T20:33:47.1437151Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:47.1437463Z x = x_sign * x_clamp 2025-05-07T20:33:47.1437704Z x0 = x[:, :D] 2025-05-07T20:33:47.1437907Z x1 = x[:, D:] 2025-05-07T20:33:47.1438111Z 2025-05-07T20:33:47.1438297Z if contiguous: 2025-05-07T20:33:47.1438524Z x0 = x0.contiguous() 2025-05-07T20:33:47.1438813Z x1 = x1.contiguous() 2025-05-07T20:33:47.1439060Z 2025-05-07T20:33:47.1439238Z if scale_ub is not None: 2025-05-07T20:33:47.1439508Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:47.1439841Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:47.1440148Z ) 2025-05-07T20:33:47.1440337Z else: 2025-05-07T20:33:47.1440540Z scale_ub_tensor = None 2025-05-07T20:33:47.1440783Z 2025-05-07T20:33:47.1441004Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:47.1441318Z op = silu_mul_quant 2025-05-07T20:33:47.1441561Z if compiled: 2025-05-07T20:33:47.1441799Z op = torch.compile(op) 2025-05-07T20:33:47.1442092Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.1442363Z 2025-05-07T20:33:47.1442547Z > y_fp8, y_scale = fn() 2025-05-07T20:33:47.1442714Z 2025-05-07T20:33:47.1442810Z moe/activation_test.py:117: 2025-05-07T20:33:47.1443103Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.1443425Z moe/activation_test.py:115: in fn 2025-05-07T20:33:47.1443702Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.1444410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:47.1445185Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:47.1445735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:47.1446449Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:47.1447182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:47.1447735Z kernel = self.compile( 2025-05-07T20:33:47.1448300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:47.1449024Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:47.1449433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.1449670Z 2025-05-07T20:33:47.1449881Z self = 2025-05-07T20:33:47.1451006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:47.1452429Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1dc7f380>} 2025-05-07T20:33:47.1455075Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:47.1456169Z context = 2025-05-07T20:33:47.1456467Z 2025-05-07T20:33:47.1456636Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:47.1457353Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:47.1457840Z module_map=module_map) 2025-05-07T20:33:47.1458213Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:47.1458576Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:47.1458843Z E ^ 2025-05-07T20:33:47.1459323Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:47.1459795Z 2025-05-07T20:33:47.1460236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:47.1460784Z 2025-05-07T20:33:47.1460887Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:47.1461307Z self=, 2025-05-07T20:33:47.1461720Z T=16384, 2025-05-07T20:33:47.1461905Z D=7168, 2025-05-07T20:33:47.1462087Z scale_ub=None, 2025-05-07T20:33:47.1462293Z contiguous=True, 2025-05-07T20:33:47.1462510Z compiled=True, 2025-05-07T20:33:47.1462698Z ) 2025-05-07T20:33:47.3210892Z self = 2025-05-07T20:33:47.3211662Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:47.3212070Z 2025-05-07T20:33:47.3212192Z @given( 2025-05-07T20:33:47.3212509Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:47.3212916Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:47.3213240Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:47.3213586Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:47.3213916Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:47.3214211Z ) 2025-05-07T20:33:47.3214715Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:47.3215180Z def test_silu_mul_quant( 2025-05-07T20:33:47.3215434Z self, 2025-05-07T20:33:47.3215763Z T: int, 2025-05-07T20:33:47.3215957Z D: int, 2025-05-07T20:33:47.3216176Z scale_ub: Optional[float], 2025-05-07T20:33:47.3216457Z contiguous: bool, 2025-05-07T20:33:47.3216700Z compiled: bool, 2025-05-07T20:33:47.3216921Z ) -> None: 2025-05-07T20:33:47.3217203Z torch.manual_seed(2025) 2025-05-07T20:33:47.3217461Z 2025-05-07T20:33:47.3217735Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:47.3218090Z 2025-05-07T20:33:47.3218284Z x_sign = torch.sign(x) 2025-05-07T20:33:47.3218636Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:47.3218957Z x = x_sign * x_clamp 2025-05-07T20:33:47.3219201Z x0 = x[:, :D] 2025-05-07T20:33:47.3219418Z x1 = x[:, D:] 2025-05-07T20:33:47.3219629Z 2025-05-07T20:33:47.3219816Z if contiguous: 2025-05-07T20:33:47.3220044Z x0 = x0.contiguous() 2025-05-07T20:33:47.3220308Z x1 = x1.contiguous() 2025-05-07T20:33:47.3220561Z 2025-05-07T20:33:47.3220754Z if scale_ub is not None: 2025-05-07T20:33:47.3221026Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:47.3221363Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:47.3221680Z ) 2025-05-07T20:33:47.3221872Z else: 2025-05-07T20:33:47.3222095Z scale_ub_tensor = None 2025-05-07T20:33:47.3222354Z 2025-05-07T20:33:47.3222647Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:47.3222974Z op = silu_mul_quant 2025-05-07T20:33:47.3223237Z if compiled: 2025-05-07T20:33:47.3223487Z op = torch.compile(op) 2025-05-07T20:33:47.3223798Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.3224084Z 2025-05-07T20:33:47.3224285Z > y_fp8, y_scale = fn() 2025-05-07T20:33:47.3224452Z 2025-05-07T20:33:47.3224561Z moe/activation_test.py:117: 2025-05-07T20:33:47.3224871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.3225216Z moe/activation_test.py:115: in fn 2025-05-07T20:33:47.3225760Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.3226350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:47.3226945Z return fn(*args, **kwargs) 2025-05-07T20:33:47.3227639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:47.3228370Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:47.3228938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:47.3229649Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:47.3230343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:47.3230908Z kernel = self.compile( 2025-05-07T20:33:47.3231474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:47.3232165Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:47.3232581Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.3232825Z 2025-05-07T20:33:47.3233043Z self = 2025-05-07T20:33:47.3234165Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:47.3235603Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d6b44a0>} 2025-05-07T20:33:47.3237096Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:47.3238208Z context = 2025-05-07T20:33:47.3238599Z 2025-05-07T20:33:47.3238774Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:47.3239323Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:47.3239867Z module_map=module_map) 2025-05-07T20:33:47.3240243Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:47.3240608Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:47.3240874Z E ^ 2025-05-07T20:33:47.3241357Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:47.3241839Z 2025-05-07T20:33:47.3242278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:47.3242822Z 2025-05-07T20:33:47.3242932Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:47.3243356Z self=, 2025-05-07T20:33:47.3243777Z T=4096, 2025-05-07T20:33:47.3243971Z D=5120, 2025-05-07T20:33:47.3244156Z scale_ub=None, 2025-05-07T20:33:47.3244437Z contiguous=False, 2025-05-07T20:33:47.3244665Z compiled=True, 2025-05-07T20:33:47.3244870Z ) 2025-05-07T20:33:47.3245185Z self = 2025-05-07T20:33:47.3245691Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:47.3245971Z 2025-05-07T20:33:47.3246056Z @given( 2025-05-07T20:33:47.3246280Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:47.3246597Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:47.3246908Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:47.3247238Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:47.3247570Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:47.3247859Z ) 2025-05-07T20:33:47.3248213Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:47.3248665Z def test_silu_mul_quant( 2025-05-07T20:33:47.3248907Z self, 2025-05-07T20:33:47.3249118Z T: int, 2025-05-07T20:33:47.3249352Z D: int, 2025-05-07T20:33:47.3249576Z scale_ub: Optional[float], 2025-05-07T20:33:47.3249849Z contiguous: bool, 2025-05-07T20:33:47.3250088Z compiled: bool, 2025-05-07T20:33:47.3250317Z ) -> None: 2025-05-07T20:33:47.3250541Z torch.manual_seed(2025) 2025-05-07T20:33:47.3250777Z 2025-05-07T20:33:47.3251055Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:47.3251413Z 2025-05-07T20:33:47.3251603Z x_sign = torch.sign(x) 2025-05-07T20:33:47.3251893Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:47.3252206Z x = x_sign * x_clamp 2025-05-07T20:33:47.3252440Z x0 = x[:, :D] 2025-05-07T20:33:47.3252657Z x1 = x[:, D:] 2025-05-07T20:33:47.3252864Z 2025-05-07T20:33:47.3253045Z if contiguous: 2025-05-07T20:33:47.3253279Z x0 = x0.contiguous() 2025-05-07T20:33:47.3253539Z x1 = x1.contiguous() 2025-05-07T20:33:47.3253785Z 2025-05-07T20:33:47.3253973Z if scale_ub is not None: 2025-05-07T20:33:47.3254251Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:47.3254709Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:47.3255016Z ) 2025-05-07T20:33:47.3255207Z else: 2025-05-07T20:33:47.3255414Z scale_ub_tensor = None 2025-05-07T20:33:47.3255657Z 2025-05-07T20:33:47.3255941Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:47.3256267Z op = silu_mul_quant 2025-05-07T20:33:47.3256519Z if compiled: 2025-05-07T20:33:47.3256763Z op = torch.compile(op) 2025-05-07T20:33:47.3257062Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.3257372Z 2025-05-07T20:33:47.3257570Z > y_fp8, y_scale = fn() 2025-05-07T20:33:47.3257732Z 2025-05-07T20:33:47.3257833Z moe/activation_test.py:117: 2025-05-07T20:33:47.3258131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.3258506Z moe/activation_test.py:115: in fn 2025-05-07T20:33:47.3258792Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.3259366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:47.3259946Z return fn(*args, **kwargs) 2025-05-07T20:33:47.3260631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:47.3261354Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:47.3261909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:47.3262621Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:47.3263358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:47.3263922Z kernel = self.compile( 2025-05-07T20:33:47.3264490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:47.3265183Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:47.3265597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.3265838Z 2025-05-07T20:33:47.3266053Z self = 2025-05-07T20:33:47.3267188Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:47.3268619Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d6b51c0>} 2025-05-07T20:33:47.3270081Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:47.3271168Z context = 2025-05-07T20:33:47.3271469Z 2025-05-07T20:33:47.3271644Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:47.3272180Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:47.3272665Z module_map=module_map) 2025-05-07T20:33:47.3273038Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:47.3273395Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:47.3273663Z E ^ 2025-05-07T20:33:47.3274144Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:47.3274617Z 2025-05-07T20:33:47.3275060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:47.3275606Z 2025-05-07T20:33:47.6522570Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:47.6523200Z self=, 2025-05-07T20:33:47.6523779Z T=4096, 2025-05-07T20:33:47.6524028Z D=5120, 2025-05-07T20:33:47.6524426Z scale_ub=1200.0, 2025-05-07T20:33:47.6524766Z contiguous=False, 2025-05-07T20:33:47.6525080Z compiled=False, 2025-05-07T20:33:47.6525344Z ) 2025-05-07T20:33:47.6525955Z self = 2025-05-07T20:33:47.6526680Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:47.6526975Z 2025-05-07T20:33:47.6527052Z @given( 2025-05-07T20:33:47.6527275Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:47.6527586Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:47.6527962Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:47.6528291Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:47.6528610Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:47.6528897Z ) 2025-05-07T20:33:47.6529241Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:47.6529687Z def test_silu_mul_quant( 2025-05-07T20:33:47.6529931Z self, 2025-05-07T20:33:47.6530120Z T: int, 2025-05-07T20:33:47.6530308Z D: int, 2025-05-07T20:33:47.6530514Z scale_ub: Optional[float], 2025-05-07T20:33:47.6530778Z contiguous: bool, 2025-05-07T20:33:47.6531010Z compiled: bool, 2025-05-07T20:33:47.6531234Z ) -> None: 2025-05-07T20:33:47.6531451Z torch.manual_seed(2025) 2025-05-07T20:33:47.6531688Z 2025-05-07T20:33:47.6532021Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:47.6532373Z 2025-05-07T20:33:47.6532568Z x_sign = torch.sign(x) 2025-05-07T20:33:47.6532852Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:47.6533166Z x = x_sign * x_clamp 2025-05-07T20:33:47.6533405Z x0 = x[:, :D] 2025-05-07T20:33:47.6533614Z x1 = x[:, D:] 2025-05-07T20:33:47.6533821Z 2025-05-07T20:33:47.6534009Z if contiguous: 2025-05-07T20:33:47.6534236Z x0 = x0.contiguous() 2025-05-07T20:33:47.6534493Z x1 = x1.contiguous() 2025-05-07T20:33:47.6534817Z 2025-05-07T20:33:47.6534998Z if scale_ub is not None: 2025-05-07T20:33:47.6535267Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:47.6535602Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:47.6535917Z ) 2025-05-07T20:33:47.6536101Z else: 2025-05-07T20:33:47.6536310Z scale_ub_tensor = None 2025-05-07T20:33:47.6536561Z 2025-05-07T20:33:47.6536786Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:47.6537104Z op = silu_mul_quant 2025-05-07T20:33:47.6537355Z if compiled: 2025-05-07T20:33:47.6537593Z op = torch.compile(op) 2025-05-07T20:33:47.6537888Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.6538160Z 2025-05-07T20:33:47.6538340Z > y_fp8, y_scale = fn() 2025-05-07T20:33:47.6538506Z 2025-05-07T20:33:47.6538602Z moe/activation_test.py:117: 2025-05-07T20:33:47.6538895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.6539231Z moe/activation_test.py:115: in fn 2025-05-07T20:33:47.6539505Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.6540224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:47.6540945Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:47.6541496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:47.6542211Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:47.6542899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:47.6543453Z kernel = self.compile( 2025-05-07T20:33:47.6544007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:47.6544768Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:47.6545171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.6545403Z 2025-05-07T20:33:47.6545654Z self = 2025-05-07T20:33:47.6546782Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:47.6548247Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d6b6160>} 2025-05-07T20:33:47.6549648Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:47.6550728Z context = 2025-05-07T20:33:47.6551033Z 2025-05-07T20:33:47.6551199Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:47.6551737Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:47.6552266Z module_map=module_map) 2025-05-07T20:33:47.6552635Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:47.6552997Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:47.6553268Z E ^ 2025-05-07T20:33:47.6553749Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:47.6554218Z 2025-05-07T20:33:47.6554653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:47.6555204Z 2025-05-07T20:33:47.6555310Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:47.6555730Z self=, 2025-05-07T20:33:47.6556154Z T=4096, 2025-05-07T20:33:47.6556333Z D=5120, 2025-05-07T20:33:47.6556530Z scale_ub=1200.0, 2025-05-07T20:33:47.6556749Z contiguous=False, 2025-05-07T20:33:47.6564037Z compiled=True, 2025-05-07T20:33:47.6564259Z ) 2025-05-07T20:33:47.6564593Z self = 2025-05-07T20:33:47.6565118Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:47.6565404Z 2025-05-07T20:33:47.6565490Z @given( 2025-05-07T20:33:47.6565715Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:47.6566030Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:47.6566336Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:47.6566669Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:47.6567005Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:47.6567298Z ) 2025-05-07T20:33:47.6567647Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:47.6568099Z def test_silu_mul_quant( 2025-05-07T20:33:47.6568346Z self, 2025-05-07T20:33:47.6568542Z T: int, 2025-05-07T20:33:47.6568729Z D: int, 2025-05-07T20:33:47.6568951Z scale_ub: Optional[float], 2025-05-07T20:33:47.6569226Z contiguous: bool, 2025-05-07T20:33:47.6569464Z compiled: bool, 2025-05-07T20:33:47.6569692Z ) -> None: 2025-05-07T20:33:47.6569914Z torch.manual_seed(2025) 2025-05-07T20:33:47.6570156Z 2025-05-07T20:33:47.6570433Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:47.6570783Z 2025-05-07T20:33:47.6570971Z x_sign = torch.sign(x) 2025-05-07T20:33:47.6571344Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:47.6571657Z x = x_sign * x_clamp 2025-05-07T20:33:47.6571903Z x0 = x[:, :D] 2025-05-07T20:33:47.6572117Z x1 = x[:, D:] 2025-05-07T20:33:47.6572324Z 2025-05-07T20:33:47.6572513Z if contiguous: 2025-05-07T20:33:47.6572786Z x0 = x0.contiguous() 2025-05-07T20:33:47.6573049Z x1 = x1.contiguous() 2025-05-07T20:33:47.6573294Z 2025-05-07T20:33:47.6573478Z if scale_ub is not None: 2025-05-07T20:33:47.6573752Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:47.6574132Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:47.6574439Z ) 2025-05-07T20:33:47.6574754Z else: 2025-05-07T20:33:47.6574965Z scale_ub_tensor = None 2025-05-07T20:33:47.6575212Z 2025-05-07T20:33:47.6575443Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:47.6575765Z op = silu_mul_quant 2025-05-07T20:33:47.6576010Z if compiled: 2025-05-07T20:33:47.6576262Z op = torch.compile(op) 2025-05-07T20:33:47.6576560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.6576833Z 2025-05-07T20:33:47.6577020Z > y_fp8, y_scale = fn() 2025-05-07T20:33:47.6577192Z 2025-05-07T20:33:47.6577290Z moe/activation_test.py:117: 2025-05-07T20:33:47.6577589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.6577969Z moe/activation_test.py:115: in fn 2025-05-07T20:33:47.6578255Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.6578840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:47.6579417Z return fn(*args, **kwargs) 2025-05-07T20:33:47.6580096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:47.6580814Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:47.6581370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:47.6582076Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:47.6582765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:47.6583315Z kernel = self.compile( 2025-05-07T20:33:47.6583869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:47.6584559Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:47.6584960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.6585200Z 2025-05-07T20:33:47.6585417Z self = 2025-05-07T20:33:47.6586531Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:47.6587956Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d6b7240>} 2025-05-07T20:33:47.6589368Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:47.6590456Z context = 2025-05-07T20:33:47.6590755Z 2025-05-07T20:33:47.6590929Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:47.6591460Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:47.6591990Z module_map=module_map) 2025-05-07T20:33:47.6592348Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:47.6592696Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:47.6592954Z E ^ 2025-05-07T20:33:47.6593474Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:47.6593945Z 2025-05-07T20:33:47.6594389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:47.6594971Z 2025-05-07T20:33:47.7736921Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:47.7737472Z self=, 2025-05-07T20:33:47.7738082Z T=2048, 2025-05-07T20:33:47.7738335Z D=7168, 2025-05-07T20:33:47.7738611Z scale_ub=1200.0, 2025-05-07T20:33:47.7738934Z contiguous=False, 2025-05-07T20:33:47.7739246Z compiled=False, 2025-05-07T20:33:47.7739537Z ) 2025-05-07T20:33:47.7739890Z self = 2025-05-07T20:33:47.7740407Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:47.7740699Z 2025-05-07T20:33:47.7740779Z @given( 2025-05-07T20:33:47.7741022Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:47.7741340Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:47.7741770Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:47.7742113Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:47.7742448Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:47.7742748Z ) 2025-05-07T20:33:47.7743101Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:47.7743553Z def test_silu_mul_quant( 2025-05-07T20:33:47.7743795Z self, 2025-05-07T20:33:47.7743987Z T: int, 2025-05-07T20:33:47.7744180Z D: int, 2025-05-07T20:33:47.7744398Z scale_ub: Optional[float], 2025-05-07T20:33:47.7744667Z contiguous: bool, 2025-05-07T20:33:47.7744903Z compiled: bool, 2025-05-07T20:33:47.7745115Z ) -> None: 2025-05-07T20:33:47.7745326Z torch.manual_seed(2025) 2025-05-07T20:33:47.7745564Z 2025-05-07T20:33:47.7745829Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:47.7746175Z 2025-05-07T20:33:47.7746364Z x_sign = torch.sign(x) 2025-05-07T20:33:47.7746647Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:47.7746960Z x = x_sign * x_clamp 2025-05-07T20:33:47.7747190Z x0 = x[:, :D] 2025-05-07T20:33:47.7747398Z x1 = x[:, D:] 2025-05-07T20:33:47.7747603Z 2025-05-07T20:33:47.7747782Z if contiguous: 2025-05-07T20:33:47.7748007Z x0 = x0.contiguous() 2025-05-07T20:33:47.7748267Z x1 = x1.contiguous() 2025-05-07T20:33:47.7748508Z 2025-05-07T20:33:47.7748698Z if scale_ub is not None: 2025-05-07T20:33:47.7748971Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:47.7749309Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:47.7749624Z ) 2025-05-07T20:33:47.7749810Z else: 2025-05-07T20:33:47.7750028Z scale_ub_tensor = None 2025-05-07T20:33:47.7750288Z 2025-05-07T20:33:47.7750519Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:47.7750839Z op = silu_mul_quant 2025-05-07T20:33:47.7751085Z if compiled: 2025-05-07T20:33:47.7751326Z op = torch.compile(op) 2025-05-07T20:33:47.7751623Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.7751902Z 2025-05-07T20:33:47.7752086Z > y_fp8, y_scale = fn() 2025-05-07T20:33:47.7752257Z 2025-05-07T20:33:47.7752355Z moe/activation_test.py:117: 2025-05-07T20:33:47.7752645Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.7753045Z moe/activation_test.py:115: in fn 2025-05-07T20:33:47.7753334Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.7754055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:47.7754840Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:47.7755441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:47.7756220Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:47.7756985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:47.7757533Z kernel = self.compile( 2025-05-07T20:33:47.7758087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:47.7758774Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:47.7759180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.7759451Z 2025-05-07T20:33:47.7759668Z self = 2025-05-07T20:33:47.7760832Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:47.7762253Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d37c220>} 2025-05-07T20:33:47.7763651Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:47.7764745Z context = 2025-05-07T20:33:47.7765047Z 2025-05-07T20:33:47.7765215Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:47.7765754Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:47.7766237Z module_map=module_map) 2025-05-07T20:33:47.7766618Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:47.7766991Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:47.7767254Z E ^ 2025-05-07T20:33:47.7767735Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:47.7768204Z 2025-05-07T20:33:47.7768641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:47.7769243Z 2025-05-07T20:33:47.7769353Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:47.7769780Z self=, 2025-05-07T20:33:47.7770200Z T=1, 2025-05-07T20:33:47.7770388Z D=7168, 2025-05-07T20:33:47.7770583Z scale_ub=None, 2025-05-07T20:33:47.7770800Z contiguous=True, 2025-05-07T20:33:47.7771026Z compiled=False, 2025-05-07T20:33:47.7771234Z ) 2025-05-07T20:33:47.7771561Z self = 2025-05-07T20:33:47.7772068Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:47.7772343Z 2025-05-07T20:33:47.7772450Z @given( 2025-05-07T20:33:47.7772676Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:47.7772994Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:47.7773304Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:47.7773628Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:47.7773958Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:47.7774300Z ) 2025-05-07T20:33:47.7774780Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:47.7775231Z def test_silu_mul_quant( 2025-05-07T20:33:47.7775470Z self, 2025-05-07T20:33:47.7775654Z T: int, 2025-05-07T20:33:47.7775889Z D: int, 2025-05-07T20:33:47.7776102Z scale_ub: Optional[float], 2025-05-07T20:33:47.7776366Z contiguous: bool, 2025-05-07T20:33:47.7776612Z compiled: bool, 2025-05-07T20:33:47.7776840Z ) -> None: 2025-05-07T20:33:47.7777122Z torch.manual_seed(2025) 2025-05-07T20:33:47.7777354Z 2025-05-07T20:33:47.7777631Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:47.7777982Z 2025-05-07T20:33:47.7778171Z x_sign = torch.sign(x) 2025-05-07T20:33:47.7778459Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:47.7778770Z x = x_sign * x_clamp 2025-05-07T20:33:47.7779004Z x0 = x[:, :D] 2025-05-07T20:33:47.7779218Z x1 = x[:, D:] 2025-05-07T20:33:47.7779430Z 2025-05-07T20:33:47.7779604Z if contiguous: 2025-05-07T20:33:47.7779834Z x0 = x0.contiguous() 2025-05-07T20:33:47.7780097Z x1 = x1.contiguous() 2025-05-07T20:33:47.7780330Z 2025-05-07T20:33:47.7780521Z if scale_ub is not None: 2025-05-07T20:33:47.7780793Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:47.7781170Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:47.7781481Z ) 2025-05-07T20:33:47.7781675Z else: 2025-05-07T20:33:47.7781880Z scale_ub_tensor = None 2025-05-07T20:33:47.7782134Z 2025-05-07T20:33:47.7782361Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:47.7782674Z op = silu_mul_quant 2025-05-07T20:33:47.7782918Z if compiled: 2025-05-07T20:33:47.7783164Z op = torch.compile(op) 2025-05-07T20:33:47.7783461Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.7783733Z 2025-05-07T20:33:47.7783926Z > y_fp8, y_scale = fn() 2025-05-07T20:33:47.7784087Z 2025-05-07T20:33:47.7784185Z moe/activation_test.py:117: 2025-05-07T20:33:47.7784481Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.7784817Z moe/activation_test.py:115: in fn 2025-05-07T20:33:47.7785098Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:47.7785816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:47.7786541Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:47.7787098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:47.7787811Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:47.7788499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:47.7789063Z kernel = self.compile( 2025-05-07T20:33:47.7789625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:47.7790314Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:47.7790718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:47.7790960Z 2025-05-07T20:33:47.7791172Z self = 2025-05-07T20:33:47.7792300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:47.7793723Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d37d120>} 2025-05-07T20:33:47.7795169Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:47.7796284Z context = 2025-05-07T20:33:47.7796583Z 2025-05-07T20:33:47.7796752Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:47.7797289Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:47.7797804Z module_map=module_map) 2025-05-07T20:33:47.7798173Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:47.7798531Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:47.7798789Z E ^ 2025-05-07T20:33:47.7799262Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:47.7799787Z 2025-05-07T20:33:47.7800222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:47.7800762Z 2025-05-07T20:33:47.7800874Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:47.7801285Z self=, 2025-05-07T20:33:47.7801701Z T=16384, 2025-05-07T20:33:47.7801938Z D=7168, 2025-05-07T20:33:47.7802128Z scale_ub=1200.0, 2025-05-07T20:33:47.7802343Z contiguous=False, 2025-05-07T20:33:47.7802567Z compiled=True, 2025-05-07T20:33:48.0210227Z ) 2025-05-07T20:33:48.0210701Z self = 2025-05-07T20:33:48.0211431Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:48.0211829Z 2025-05-07T20:33:48.0211946Z @given( 2025-05-07T20:33:48.0212307Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:48.0212730Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:48.0213037Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:48.0213366Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:48.0213699Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:48.0213982Z ) 2025-05-07T20:33:48.0214325Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:48.0214920Z def test_silu_mul_quant( 2025-05-07T20:33:48.0215169Z self, 2025-05-07T20:33:48.0215361Z T: int, 2025-05-07T20:33:48.0215562Z D: int, 2025-05-07T20:33:48.0215813Z scale_ub: Optional[float], 2025-05-07T20:33:48.0216147Z contiguous: bool, 2025-05-07T20:33:48.0216387Z compiled: bool, 2025-05-07T20:33:48.0216608Z ) -> None: 2025-05-07T20:33:48.0216813Z torch.manual_seed(2025) 2025-05-07T20:33:48.0217052Z 2025-05-07T20:33:48.0217333Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:48.0217679Z 2025-05-07T20:33:48.0217869Z x_sign = torch.sign(x) 2025-05-07T20:33:48.0218162Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:48.0218468Z x = x_sign * x_clamp 2025-05-07T20:33:48.0218715Z x0 = x[:, :D] 2025-05-07T20:33:48.0218950Z x1 = x[:, D:] 2025-05-07T20:33:48.0219175Z 2025-05-07T20:33:48.0219353Z if contiguous: 2025-05-07T20:33:48.0219586Z x0 = x0.contiguous() 2025-05-07T20:33:48.0219844Z x1 = x1.contiguous() 2025-05-07T20:33:48.0220080Z 2025-05-07T20:33:48.0220272Z if scale_ub is not None: 2025-05-07T20:33:48.0220546Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:48.0220879Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:48.0221187Z ) 2025-05-07T20:33:48.0221381Z else: 2025-05-07T20:33:48.0221701Z scale_ub_tensor = None 2025-05-07T20:33:48.0221950Z 2025-05-07T20:33:48.0222174Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:48.0222480Z op = silu_mul_quant 2025-05-07T20:33:48.0222724Z if compiled: 2025-05-07T20:33:48.0222966Z op = torch.compile(op) 2025-05-07T20:33:48.0223320Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.0223601Z 2025-05-07T20:33:48.0223791Z > y_fp8, y_scale = fn() 2025-05-07T20:33:48.0223953Z 2025-05-07T20:33:48.0224058Z moe/activation_test.py:117: 2025-05-07T20:33:48.0224416Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.0224750Z moe/activation_test.py:115: in fn 2025-05-07T20:33:48.0225029Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.0225842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:48.0226434Z return fn(*args, **kwargs) 2025-05-07T20:33:48.0227126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:48.0227849Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:48.0228404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:48.0229120Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:48.0229893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:48.0230451Z kernel = self.compile( 2025-05-07T20:33:48.0231019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:48.0231715Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:48.0232128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.0232372Z 2025-05-07T20:33:48.0232586Z self = 2025-05-07T20:33:48.0233715Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:48.0235146Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d37e520>} 2025-05-07T20:33:48.0236559Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:48.0237643Z context = 2025-05-07T20:33:48.0237941Z 2025-05-07T20:33:48.0238110Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:48.0238660Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:48.0239151Z module_map=module_map) 2025-05-07T20:33:48.0239524Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:48.0239884Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:48.0240150Z E ^ 2025-05-07T20:33:48.0240632Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:48.0241114Z 2025-05-07T20:33:48.0241552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:48.0242102Z 2025-05-07T20:33:48.0242208Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:48.0242634Z self=, 2025-05-07T20:33:48.0243183Z T=1, 2025-05-07T20:33:48.0243370Z D=7168, 2025-05-07T20:33:48.0243561Z scale_ub=None, 2025-05-07T20:33:48.0243771Z contiguous=False, 2025-05-07T20:33:48.0243997Z compiled=False, 2025-05-07T20:33:48.0244198Z ) 2025-05-07T20:33:48.0244588Z self = 2025-05-07T20:33:48.0245089Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:48.0245362Z 2025-05-07T20:33:48.0245436Z @given( 2025-05-07T20:33:48.0245662Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:48.0246033Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:48.0246338Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:48.0246671Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:48.0246998Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:48.0247287Z ) 2025-05-07T20:33:48.0247638Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:48.0248098Z def test_silu_mul_quant( 2025-05-07T20:33:48.0248334Z self, 2025-05-07T20:33:48.0248527Z T: int, 2025-05-07T20:33:48.0248720Z D: int, 2025-05-07T20:33:48.0248939Z scale_ub: Optional[float], 2025-05-07T20:33:48.0249253Z contiguous: bool, 2025-05-07T20:33:48.0249498Z compiled: bool, 2025-05-07T20:33:48.0249710Z ) -> None: 2025-05-07T20:33:48.0249917Z torch.manual_seed(2025) 2025-05-07T20:33:48.0250204Z 2025-05-07T20:33:48.0250472Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:48.0250830Z 2025-05-07T20:33:48.0251024Z x_sign = torch.sign(x) 2025-05-07T20:33:48.0251305Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:48.0251614Z x = x_sign * x_clamp 2025-05-07T20:33:48.0251847Z x0 = x[:, :D] 2025-05-07T20:33:48.0252047Z x1 = x[:, D:] 2025-05-07T20:33:48.0252248Z 2025-05-07T20:33:48.0252435Z if contiguous: 2025-05-07T20:33:48.0252662Z x0 = x0.contiguous() 2025-05-07T20:33:48.0252916Z x1 = x1.contiguous() 2025-05-07T20:33:48.0253149Z 2025-05-07T20:33:48.0253338Z if scale_ub is not None: 2025-05-07T20:33:48.0253605Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:48.0253947Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:48.0254259Z ) 2025-05-07T20:33:48.0254439Z else: 2025-05-07T20:33:48.0261479Z scale_ub_tensor = None 2025-05-07T20:33:48.0261785Z 2025-05-07T20:33:48.0262035Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:48.0262361Z op = silu_mul_quant 2025-05-07T20:33:48.0262624Z if compiled: 2025-05-07T20:33:48.0262887Z op = torch.compile(op) 2025-05-07T20:33:48.0263182Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.0263463Z 2025-05-07T20:33:48.0263656Z > y_fp8, y_scale = fn() 2025-05-07T20:33:48.0263824Z 2025-05-07T20:33:48.0263924Z moe/activation_test.py:117: 2025-05-07T20:33:48.0264235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.0264585Z moe/activation_test.py:115: in fn 2025-05-07T20:33:48.0264885Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.0265601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:48.0266332Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:48.0266891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:48.0267597Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:48.0268285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:48.0268851Z kernel = self.compile( 2025-05-07T20:33:48.0269486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:48.0270174Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:48.0270628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.0270869Z 2025-05-07T20:33:48.0271080Z self = 2025-05-07T20:33:48.0272213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:48.0273827Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d37f100>} 2025-05-07T20:33:48.0275395Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:48.0276579Z context = 2025-05-07T20:33:48.0276937Z 2025-05-07T20:33:48.0277121Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:48.0277805Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:48.0278304Z module_map=module_map) 2025-05-07T20:33:48.0278697Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:48.0279063Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:48.0279329Z E ^ 2025-05-07T20:33:48.0279811Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:48.0280283Z 2025-05-07T20:33:48.0280725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:48.0281270Z 2025-05-07T20:33:48.0281372Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:48.0281786Z self=, 2025-05-07T20:33:48.0282201Z T=2048, 2025-05-07T20:33:48.0282392Z D=7168, 2025-05-07T20:33:48.0282573Z scale_ub=None, 2025-05-07T20:33:48.0282791Z contiguous=False, 2025-05-07T20:33:48.0283018Z compiled=True, 2025-05-07T20:33:48.0283210Z ) 2025-05-07T20:33:48.1153983Z self = 2025-05-07T20:33:48.1155497Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:48.1156272Z 2025-05-07T20:33:48.1156489Z @given( 2025-05-07T20:33:48.1157099Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:48.1157720Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:48.1158338Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:48.1158942Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:48.1159317Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:48.1159606Z ) 2025-05-07T20:33:48.1159953Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:48.1160399Z def test_silu_mul_quant( 2025-05-07T20:33:48.1160638Z self, 2025-05-07T20:33:48.1160840Z T: int, 2025-05-07T20:33:48.1161026Z D: int, 2025-05-07T20:33:48.1161239Z scale_ub: Optional[float], 2025-05-07T20:33:48.1161508Z contiguous: bool, 2025-05-07T20:33:48.1161736Z compiled: bool, 2025-05-07T20:33:48.1161953Z ) -> None: 2025-05-07T20:33:48.1162162Z torch.manual_seed(2025) 2025-05-07T20:33:48.1162395Z 2025-05-07T20:33:48.1162668Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:48.1164434Z 2025-05-07T20:33:48.1164737Z x_sign = torch.sign(x) 2025-05-07T20:33:48.1165025Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:48.1165335Z x = x_sign * x_clamp 2025-05-07T20:33:48.1165571Z x0 = x[:, :D] 2025-05-07T20:33:48.1165781Z x1 = x[:, D:] 2025-05-07T20:33:48.1165984Z 2025-05-07T20:33:48.1166224Z if contiguous: 2025-05-07T20:33:48.1166453Z x0 = x0.contiguous() 2025-05-07T20:33:48.1166708Z x1 = x1.contiguous() 2025-05-07T20:33:48.1166953Z 2025-05-07T20:33:48.1167139Z if scale_ub is not None: 2025-05-07T20:33:48.1167468Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:48.1167795Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:48.1168098Z ) 2025-05-07T20:33:48.1168283Z else: 2025-05-07T20:33:48.1168484Z scale_ub_tensor = None 2025-05-07T20:33:48.1168728Z 2025-05-07T20:33:48.1168956Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:48.1169273Z op = silu_mul_quant 2025-05-07T20:33:48.1169519Z if compiled: 2025-05-07T20:33:48.1169755Z op = torch.compile(op) 2025-05-07T20:33:48.1170054Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.1170334Z 2025-05-07T20:33:48.1170523Z > y_fp8, y_scale = fn() 2025-05-07T20:33:48.1170689Z 2025-05-07T20:33:48.1170788Z moe/activation_test.py:117: 2025-05-07T20:33:48.1171153Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.1171483Z moe/activation_test.py:115: in fn 2025-05-07T20:33:48.1171765Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.1172342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:48.1172922Z return fn(*args, **kwargs) 2025-05-07T20:33:48.1173596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:48.1174316Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:48.1175075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:48.1175783Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:48.1176473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:48.1177030Z kernel = self.compile( 2025-05-07T20:33:48.1177588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:48.1178267Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:48.1178678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.1178915Z 2025-05-07T20:33:48.1179128Z self = 2025-05-07T20:33:48.1180252Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:48.1181665Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c08400720>} 2025-05-07T20:33:48.1183066Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:48.1184144Z context = 2025-05-07T20:33:48.1184441Z 2025-05-07T20:33:48.1184610Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:48.1185135Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:48.1185657Z module_map=module_map) 2025-05-07T20:33:48.1186023Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:48.1186374Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:48.1186627Z E ^ 2025-05-07T20:33:48.1187139Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:48.1187611Z 2025-05-07T20:33:48.1188049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:48.1188628Z 2025-05-07T20:33:48.1188735Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:48.1189152Z self=, 2025-05-07T20:33:48.1189568Z T=4096, 2025-05-07T20:33:48.1189753Z D=7168, 2025-05-07T20:33:48.1189941Z scale_ub=None, 2025-05-07T20:33:48.1190162Z contiguous=False, 2025-05-07T20:33:48.1190389Z compiled=True, 2025-05-07T20:33:48.1190588Z ) 2025-05-07T20:33:48.1190909Z self = 2025-05-07T20:33:48.1191418Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:48.1191698Z 2025-05-07T20:33:48.1191779Z @given( 2025-05-07T20:33:48.1192010Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:48.1192369Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:48.1192676Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:48.1193014Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:48.1193347Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:48.1193639Z ) 2025-05-07T20:33:48.1193985Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:48.1194450Z def test_silu_mul_quant( 2025-05-07T20:33:48.1194696Z self, 2025-05-07T20:33:48.1194890Z T: int, 2025-05-07T20:33:48.1195090Z D: int, 2025-05-07T20:33:48.1195303Z scale_ub: Optional[float], 2025-05-07T20:33:48.1195567Z contiguous: bool, 2025-05-07T20:33:48.1195802Z compiled: bool, 2025-05-07T20:33:48.1196017Z ) -> None: 2025-05-07T20:33:48.1196223Z torch.manual_seed(2025) 2025-05-07T20:33:48.1196457Z 2025-05-07T20:33:48.1196725Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:48.1197073Z 2025-05-07T20:33:48.1197257Z x_sign = torch.sign(x) 2025-05-07T20:33:48.1197543Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:48.1197853Z x = x_sign * x_clamp 2025-05-07T20:33:48.1198081Z x0 = x[:, :D] 2025-05-07T20:33:48.1198286Z x1 = x[:, D:] 2025-05-07T20:33:48.1198492Z 2025-05-07T20:33:48.1198667Z if contiguous: 2025-05-07T20:33:48.1198893Z x0 = x0.contiguous() 2025-05-07T20:33:48.1199145Z x1 = x1.contiguous() 2025-05-07T20:33:48.1199388Z 2025-05-07T20:33:48.1199604Z if scale_ub is not None: 2025-05-07T20:33:48.1199880Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:48.1200224Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:48.1200540Z ) 2025-05-07T20:33:48.1200722Z else: 2025-05-07T20:33:48.1200928Z scale_ub_tensor = None 2025-05-07T20:33:48.1201174Z 2025-05-07T20:33:48.1201405Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:48.1201723Z op = silu_mul_quant 2025-05-07T20:33:48.1201965Z if compiled: 2025-05-07T20:33:48.1202202Z op = torch.compile(op) 2025-05-07T20:33:48.1202498Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.1202767Z 2025-05-07T20:33:48.1202961Z > y_fp8, y_scale = fn() 2025-05-07T20:33:48.1203125Z 2025-05-07T20:33:48.1203227Z moe/activation_test.py:117: 2025-05-07T20:33:48.1203579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.1203914Z moe/activation_test.py:115: in fn 2025-05-07T20:33:48.1204190Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.1204806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:48.1205389Z return fn(*args, **kwargs) 2025-05-07T20:33:48.1206083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:48.1206872Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:48.1207436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:48.1208152Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:48.1208859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:48.1209471Z kernel = self.compile( 2025-05-07T20:33:48.1210028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:48.1210723Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:48.1211135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.1211466Z 2025-05-07T20:33:48.1211761Z self = 2025-05-07T20:33:48.1212885Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:48.1214319Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c08401440>} 2025-05-07T20:33:48.1215813Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:48.1216907Z context = 2025-05-07T20:33:48.1217204Z 2025-05-07T20:33:48.1217370Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:48.1217911Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:48.1218418Z module_map=module_map) 2025-05-07T20:33:48.1218854Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:48.1219210Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:48.1219475Z E ^ 2025-05-07T20:33:48.1219949Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:48.1220423Z 2025-05-07T20:33:48.1220859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:48.1221406Z 2025-05-07T20:33:48.2808123Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:48.2808748Z self=, 2025-05-07T20:33:48.2809399Z T=16384, 2025-05-07T20:33:48.2809675Z D=5120, 2025-05-07T20:33:48.2809934Z scale_ub=1200.0, 2025-05-07T20:33:48.2810247Z contiguous=False, 2025-05-07T20:33:48.2810554Z compiled=False, 2025-05-07T20:33:48.2810788Z ) 2025-05-07T20:33:48.2811107Z self = 2025-05-07T20:33:48.2811616Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:48.2811910Z 2025-05-07T20:33:48.2811987Z @given( 2025-05-07T20:33:48.2812205Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:48.2812655Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:48.2812952Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:48.2813283Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:48.2813613Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:48.2813893Z ) 2025-05-07T20:33:48.2814307Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:48.2814867Z def test_silu_mul_quant( 2025-05-07T20:33:48.2815104Z self, 2025-05-07T20:33:48.2815293Z T: int, 2025-05-07T20:33:48.2815552Z D: int, 2025-05-07T20:33:48.2815768Z scale_ub: Optional[float], 2025-05-07T20:33:48.2816045Z contiguous: bool, 2025-05-07T20:33:48.2816286Z compiled: bool, 2025-05-07T20:33:48.2816515Z ) -> None: 2025-05-07T20:33:48.2816723Z torch.manual_seed(2025) 2025-05-07T20:33:48.2816955Z 2025-05-07T20:33:48.2817222Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:48.2817569Z 2025-05-07T20:33:48.2817759Z x_sign = torch.sign(x) 2025-05-07T20:33:48.2818049Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:48.2818357Z x = x_sign * x_clamp 2025-05-07T20:33:48.2818589Z x0 = x[:, :D] 2025-05-07T20:33:48.2818816Z x1 = x[:, D:] 2025-05-07T20:33:48.2819020Z 2025-05-07T20:33:48.2819205Z if contiguous: 2025-05-07T20:33:48.2819462Z x0 = x0.contiguous() 2025-05-07T20:33:48.2819812Z x1 = x1.contiguous() 2025-05-07T20:33:48.2820060Z 2025-05-07T20:33:48.2820249Z if scale_ub is not None: 2025-05-07T20:33:48.2820516Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:48.2820851Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:48.2821158Z ) 2025-05-07T20:33:48.2821349Z else: 2025-05-07T20:33:48.2821555Z scale_ub_tensor = None 2025-05-07T20:33:48.2821809Z 2025-05-07T20:33:48.2822049Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:48.2822363Z op = silu_mul_quant 2025-05-07T20:33:48.2822609Z if compiled: 2025-05-07T20:33:48.2822853Z op = torch.compile(op) 2025-05-07T20:33:48.2823144Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.2823419Z 2025-05-07T20:33:48.2823605Z > y_fp8, y_scale = fn() 2025-05-07T20:33:48.2823766Z 2025-05-07T20:33:48.2823863Z moe/activation_test.py:117: 2025-05-07T20:33:48.2824164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.2824501Z moe/activation_test.py:115: in fn 2025-05-07T20:33:48.2824774Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.2825733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:48.2826465Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:48.2827021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:48.2827724Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:48.2828419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:48.2828972Z kernel = self.compile( 2025-05-07T20:33:48.2829532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:48.2830206Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:48.2830613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.2830850Z 2025-05-07T20:33:48.2831064Z self = 2025-05-07T20:33:48.2832179Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:48.2833737Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c08402340>} 2025-05-07T20:33:48.2835143Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:48.2836297Z context = 2025-05-07T20:33:48.2836596Z 2025-05-07T20:33:48.2836770Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:48.2837301Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:48.2837781Z module_map=module_map) 2025-05-07T20:33:48.2838158Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:48.2838519Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:48.2838782Z E ^ 2025-05-07T20:33:48.2839264Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:48.2839735Z 2025-05-07T20:33:48.2840235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:48.2840805Z 2025-05-07T20:33:48.2840914Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:48.2841335Z self=, 2025-05-07T20:33:48.2841750Z T=16384, 2025-05-07T20:33:48.2841943Z D=5120, 2025-05-07T20:33:48.2842128Z scale_ub=1200.0, 2025-05-07T20:33:48.2842341Z contiguous=True, 2025-05-07T20:33:48.2842552Z compiled=True, 2025-05-07T20:33:48.2842747Z ) 2025-05-07T20:33:48.2843064Z self = 2025-05-07T20:33:48.2843573Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:48.2843854Z 2025-05-07T20:33:48.2843937Z @given( 2025-05-07T20:33:48.2844157Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:48.2844476Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:48.2844784Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:48.2845108Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:48.2845441Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:48.2845730Z ) 2025-05-07T20:33:48.2846083Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:48.2846534Z def test_silu_mul_quant( 2025-05-07T20:33:48.2846779Z self, 2025-05-07T20:33:48.2846974Z T: int, 2025-05-07T20:33:48.2847167Z D: int, 2025-05-07T20:33:48.2847379Z scale_ub: Optional[float], 2025-05-07T20:33:48.2847650Z contiguous: bool, 2025-05-07T20:33:48.2847879Z compiled: bool, 2025-05-07T20:33:48.2848095Z ) -> None: 2025-05-07T20:33:48.2848304Z torch.manual_seed(2025) 2025-05-07T20:33:48.2848540Z 2025-05-07T20:33:48.2848815Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:48.2849183Z 2025-05-07T20:33:48.2849393Z x_sign = torch.sign(x) 2025-05-07T20:33:48.2849681Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:48.2849987Z x = x_sign * x_clamp 2025-05-07T20:33:48.2850219Z x0 = x[:, :D] 2025-05-07T20:33:48.2850438Z x1 = x[:, D:] 2025-05-07T20:33:48.2850641Z 2025-05-07T20:33:48.2850821Z if contiguous: 2025-05-07T20:33:48.2851042Z x0 = x0.contiguous() 2025-05-07T20:33:48.2851293Z x1 = x1.contiguous() 2025-05-07T20:33:48.2851526Z 2025-05-07T20:33:48.2851706Z if scale_ub is not None: 2025-05-07T20:33:48.2852027Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:48.2852358Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:48.2852665Z ) 2025-05-07T20:33:48.2852865Z else: 2025-05-07T20:33:48.2853074Z scale_ub_tensor = None 2025-05-07T20:33:48.2853363Z 2025-05-07T20:33:48.2853591Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:48.2853905Z op = silu_mul_quant 2025-05-07T20:33:48.2854148Z if compiled: 2025-05-07T20:33:48.2854391Z op = torch.compile(op) 2025-05-07T20:33:48.2854876Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.2855151Z 2025-05-07T20:33:48.2855348Z > y_fp8, y_scale = fn() 2025-05-07T20:33:48.2855516Z 2025-05-07T20:33:48.2855615Z moe/activation_test.py:117: 2025-05-07T20:33:48.2855918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.2856256Z moe/activation_test.py:115: in fn 2025-05-07T20:33:48.2856548Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.2857119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:48.2857694Z return fn(*args, **kwargs) 2025-05-07T20:33:48.2858379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:48.2859101Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:48.2859705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:48.2860428Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:48.2861124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:48.2861686Z kernel = self.compile( 2025-05-07T20:33:48.2862245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:48.2862934Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:48.2863344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.2863578Z 2025-05-07T20:33:48.2863800Z self = 2025-05-07T20:33:48.2864921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:48.2866355Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c084039c0>} 2025-05-07T20:33:48.2867760Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:48.2868842Z context = 2025-05-07T20:33:48.2869142Z 2025-05-07T20:33:48.2869315Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:48.2869849Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:48.2870341Z module_map=module_map) 2025-05-07T20:33:48.2870712Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:48.2871084Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:48.2871348Z E ^ 2025-05-07T20:33:48.2871826Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:48.2872300Z 2025-05-07T20:33:48.2872736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:48.2873332Z 2025-05-07T20:33:48.4570482Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:48.4581773Z self=, 2025-05-07T20:33:48.4582207Z T=16384, 2025-05-07T20:33:48.4582522Z D=5120, 2025-05-07T20:33:48.4582723Z scale_ub=None, 2025-05-07T20:33:48.4582945Z contiguous=False, 2025-05-07T20:33:48.4583178Z compiled=True, 2025-05-07T20:33:48.4583386Z ) 2025-05-07T20:33:48.4583715Z self = 2025-05-07T20:33:48.4584282Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:48.4584572Z 2025-05-07T20:33:48.4584646Z @given( 2025-05-07T20:33:48.4584867Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:48.4585175Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:48.4585484Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:48.4585817Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:48.4586141Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:48.4586423Z ) 2025-05-07T20:33:48.4586770Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:48.4587225Z def test_silu_mul_quant( 2025-05-07T20:33:48.4587458Z self, 2025-05-07T20:33:48.4587648Z T: int, 2025-05-07T20:33:48.4587838Z D: int, 2025-05-07T20:33:48.4588110Z scale_ub: Optional[float], 2025-05-07T20:33:48.4588391Z contiguous: bool, 2025-05-07T20:33:48.4588630Z compiled: bool, 2025-05-07T20:33:48.4588850Z ) -> None: 2025-05-07T20:33:48.4589069Z torch.manual_seed(2025) 2025-05-07T20:33:48.4589315Z 2025-05-07T20:33:48.4589583Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:48.4589927Z 2025-05-07T20:33:48.4590117Z x_sign = torch.sign(x) 2025-05-07T20:33:48.4590407Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:48.4590719Z x = x_sign * x_clamp 2025-05-07T20:33:48.4590957Z x0 = x[:, :D] 2025-05-07T20:33:48.4591169Z x1 = x[:, D:] 2025-05-07T20:33:48.4591368Z 2025-05-07T20:33:48.4591549Z if contiguous: 2025-05-07T20:33:48.4591777Z x0 = x0.contiguous() 2025-05-07T20:33:48.4592030Z x1 = x1.contiguous() 2025-05-07T20:33:48.4592262Z 2025-05-07T20:33:48.4592450Z if scale_ub is not None: 2025-05-07T20:33:48.4592715Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:48.4593058Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:48.4593372Z ) 2025-05-07T20:33:48.4593554Z else: 2025-05-07T20:33:48.4593762Z scale_ub_tensor = None 2025-05-07T20:33:48.4594020Z 2025-05-07T20:33:48.4594247Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:48.4594567Z op = silu_mul_quant 2025-05-07T20:33:48.4594816Z if compiled: 2025-05-07T20:33:48.4595056Z op = torch.compile(op) 2025-05-07T20:33:48.4595359Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.4595646Z 2025-05-07T20:33:48.4595841Z > y_fp8, y_scale = fn() 2025-05-07T20:33:48.4596008Z 2025-05-07T20:33:48.4596110Z moe/activation_test.py:117: 2025-05-07T20:33:48.4596407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.4596753Z moe/activation_test.py:115: in fn 2025-05-07T20:33:48.4597038Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.4597616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:48.4598196Z return fn(*args, **kwargs) 2025-05-07T20:33:48.4598866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:48.4599711Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:48.4600268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:48.4600977Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:48.4601702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:48.4602261Z kernel = self.compile( 2025-05-07T20:33:48.4602826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:48.4603549Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:48.4603951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.4604189Z 2025-05-07T20:33:48.4604402Z self = 2025-05-07T20:33:48.4605530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:48.4606965Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d1a0c20>} 2025-05-07T20:33:48.4608410Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:48.4609549Z context = 2025-05-07T20:33:48.4609856Z 2025-05-07T20:33:48.4610022Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:48.4610556Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:48.4611033Z module_map=module_map) 2025-05-07T20:33:48.4611411Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:48.4611777Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:48.4612046Z E ^ 2025-05-07T20:33:48.4612524Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:48.4612999Z 2025-05-07T20:33:48.4613438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:48.4613980Z 2025-05-07T20:33:48.4614090Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:48.4614571Z self=, 2025-05-07T20:33:48.4614993Z T=2048, 2025-05-07T20:33:48.4615183Z D=5120, 2025-05-07T20:33:48.4615380Z scale_ub=None, 2025-05-07T20:33:48.4615593Z contiguous=False, 2025-05-07T20:33:48.4615827Z compiled=True, 2025-05-07T20:33:48.4616037Z ) 2025-05-07T20:33:48.5508085Z self = 2025-05-07T20:33:48.5508862Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:48.5509691Z 2025-05-07T20:33:48.5509902Z @given( 2025-05-07T20:33:48.5510536Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:48.5511396Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:48.5512011Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:48.5512666Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:48.5513323Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:48.5513890Z ) 2025-05-07T20:33:48.5514583Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:48.5515486Z def test_silu_mul_quant( 2025-05-07T20:33:48.5515960Z self, 2025-05-07T20:33:48.5516331Z T: int, 2025-05-07T20:33:48.5516922Z D: int, 2025-05-07T20:33:48.5517332Z scale_ub: Optional[float], 2025-05-07T20:33:48.5517880Z contiguous: bool, 2025-05-07T20:33:48.5518346Z compiled: bool, 2025-05-07T20:33:48.5518766Z ) -> None: 2025-05-07T20:33:48.5519078Z torch.manual_seed(2025) 2025-05-07T20:33:48.5519384Z 2025-05-07T20:33:48.5519657Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:48.5520006Z 2025-05-07T20:33:48.5520203Z x_sign = torch.sign(x) 2025-05-07T20:33:48.5520482Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:48.5520861Z x = x_sign * x_clamp 2025-05-07T20:33:48.5521093Z x0 = x[:, :D] 2025-05-07T20:33:48.5521307Z x1 = x[:, D:] 2025-05-07T20:33:48.5521506Z 2025-05-07T20:33:48.5521684Z if contiguous: 2025-05-07T20:33:48.5521915Z x0 = x0.contiguous() 2025-05-07T20:33:48.5522178Z x1 = x1.contiguous() 2025-05-07T20:33:48.5522416Z 2025-05-07T20:33:48.5522602Z if scale_ub is not None: 2025-05-07T20:33:48.5522873Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:48.5523209Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:48.5523519Z ) 2025-05-07T20:33:48.5523702Z else: 2025-05-07T20:33:48.5523912Z scale_ub_tensor = None 2025-05-07T20:33:48.5524168Z 2025-05-07T20:33:48.5524391Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:48.5524775Z op = silu_mul_quant 2025-05-07T20:33:48.5525029Z if compiled: 2025-05-07T20:33:48.5525276Z op = torch.compile(op) 2025-05-07T20:33:48.5525909Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.5526195Z 2025-05-07T20:33:48.5526389Z > y_fp8, y_scale = fn() 2025-05-07T20:33:48.5526558Z 2025-05-07T20:33:48.5526656Z moe/activation_test.py:117: 2025-05-07T20:33:48.5526961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.5527308Z moe/activation_test.py:115: in fn 2025-05-07T20:33:48.5527583Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.5528166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:48.5528753Z return fn(*args, **kwargs) 2025-05-07T20:33:48.5529439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:48.5530165Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:48.5530729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:48.5531447Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:48.5532131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:48.5532687Z kernel = self.compile( 2025-05-07T20:33:48.5533250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:48.5533937Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:48.5534340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.5534651Z 2025-05-07T20:33:48.5534863Z self = 2025-05-07T20:33:48.5535986Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:48.5537414Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d1a19e0>} 2025-05-07T20:33:48.5538827Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:48.5540020Z context = 2025-05-07T20:33:48.5540322Z 2025-05-07T20:33:48.5540551Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:48.5541088Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:48.5541567Z module_map=module_map) 2025-05-07T20:33:48.5541993Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:48.5542356Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:48.5542623Z E ^ 2025-05-07T20:33:48.5543115Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:48.5543592Z 2025-05-07T20:33:48.5544038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:48.5544596Z 2025-05-07T20:33:48.5544754Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:48.5545214Z self=, 2025-05-07T20:33:48.5545636Z T=2048, 2025-05-07T20:33:48.5545827Z D=5120, 2025-05-07T20:33:48.5546011Z scale_ub=1200.0, 2025-05-07T20:33:48.5546232Z contiguous=False, 2025-05-07T20:33:48.5546577Z compiled=True, 2025-05-07T20:33:48.5546779Z ) 2025-05-07T20:33:48.5547099Z self = 2025-05-07T20:33:48.5547617Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:48.5547904Z 2025-05-07T20:33:48.5547984Z @given( 2025-05-07T20:33:48.5548210Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:48.5548519Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:48.5548827Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:48.5549162Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:48.5549490Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:48.5549781Z ) 2025-05-07T20:33:48.5550137Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:48.5550593Z def test_silu_mul_quant( 2025-05-07T20:33:48.5550829Z self, 2025-05-07T20:33:48.5551021Z T: int, 2025-05-07T20:33:48.5551210Z D: int, 2025-05-07T20:33:48.5551424Z scale_ub: Optional[float], 2025-05-07T20:33:48.5551695Z contiguous: bool, 2025-05-07T20:33:48.5551932Z compiled: bool, 2025-05-07T20:33:48.5552142Z ) -> None: 2025-05-07T20:33:48.5552346Z torch.manual_seed(2025) 2025-05-07T20:33:48.5552577Z 2025-05-07T20:33:48.5552848Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:48.5553196Z 2025-05-07T20:33:48.5553389Z x_sign = torch.sign(x) 2025-05-07T20:33:48.5553675Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:48.5553989Z x = x_sign * x_clamp 2025-05-07T20:33:48.5554230Z x0 = x[:, :D] 2025-05-07T20:33:48.5554455Z x1 = x[:, D:] 2025-05-07T20:33:48.5554659Z 2025-05-07T20:33:48.5554852Z if contiguous: 2025-05-07T20:33:48.5555078Z x0 = x0.contiguous() 2025-05-07T20:33:48.5555332Z x1 = x1.contiguous() 2025-05-07T20:33:48.5555582Z 2025-05-07T20:33:48.5555776Z if scale_ub is not None: 2025-05-07T20:33:48.5556051Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:48.5556389Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:48.5556707Z ) 2025-05-07T20:33:48.5556886Z else: 2025-05-07T20:33:48.5557090Z scale_ub_tensor = None 2025-05-07T20:33:48.5557343Z 2025-05-07T20:33:48.5557574Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:48.5557939Z op = silu_mul_quant 2025-05-07T20:33:48.5558184Z if compiled: 2025-05-07T20:33:48.5558424Z op = torch.compile(op) 2025-05-07T20:33:48.5558717Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.5558998Z 2025-05-07T20:33:48.5559225Z > y_fp8, y_scale = fn() 2025-05-07T20:33:48.5559391Z 2025-05-07T20:33:48.5559486Z moe/activation_test.py:117: 2025-05-07T20:33:48.5559784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.5560161Z moe/activation_test.py:115: in fn 2025-05-07T20:33:48.5560434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.5561005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:48.5561593Z return fn(*args, **kwargs) 2025-05-07T20:33:48.5562276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:48.5562995Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:48.5563547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:48.5564261Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:48.5564951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:48.5565542Z kernel = self.compile( 2025-05-07T20:33:48.5566105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:48.5566796Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:48.5567196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.5567439Z 2025-05-07T20:33:48.5567650Z self = 2025-05-07T20:33:48.5568771Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:48.5570197Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d1a2b60>} 2025-05-07T20:33:48.5571604Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:48.5572681Z context = 2025-05-07T20:33:48.5572987Z 2025-05-07T20:33:48.5573154Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:48.5573690Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:48.5574167Z module_map=module_map) 2025-05-07T20:33:48.5574643Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:48.5575010Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:48.5575274Z E ^ 2025-05-07T20:33:48.5575748Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:48.5576224Z 2025-05-07T20:33:48.5576660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:48.5577211Z 2025-05-07T20:33:48.7315496Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:48.7316137Z self=, 2025-05-07T20:33:48.7316698Z T=4096, 2025-05-07T20:33:48.7316962Z D=5120, 2025-05-07T20:33:48.7317248Z scale_ub=1200.0, 2025-05-07T20:33:48.7317709Z contiguous=True, 2025-05-07T20:33:48.7317968Z compiled=True, 2025-05-07T20:33:48.7318178Z ) 2025-05-07T20:33:48.7318506Z self = 2025-05-07T20:33:48.7319012Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:48.7319372Z 2025-05-07T20:33:48.7319454Z @given( 2025-05-07T20:33:48.7319686Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:48.7320008Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:48.7320311Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:48.7320709Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:48.7321042Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:48.7321331Z ) 2025-05-07T20:33:48.7321678Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:48.7322129Z def test_silu_mul_quant( 2025-05-07T20:33:48.7322365Z self, 2025-05-07T20:33:48.7322570Z T: int, 2025-05-07T20:33:48.7322760Z D: int, 2025-05-07T20:33:48.7322969Z scale_ub: Optional[float], 2025-05-07T20:33:48.7323242Z contiguous: bool, 2025-05-07T20:33:48.7323480Z compiled: bool, 2025-05-07T20:33:48.7323692Z ) -> None: 2025-05-07T20:33:48.7323909Z torch.manual_seed(2025) 2025-05-07T20:33:48.7324151Z 2025-05-07T20:33:48.7324435Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:48.7324851Z 2025-05-07T20:33:48.7325042Z x_sign = torch.sign(x) 2025-05-07T20:33:48.7325334Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:48.7325883Z x = x_sign * x_clamp 2025-05-07T20:33:48.7326118Z x0 = x[:, :D] 2025-05-07T20:33:48.7326330Z x1 = x[:, D:] 2025-05-07T20:33:48.7326534Z 2025-05-07T20:33:48.7326712Z if contiguous: 2025-05-07T20:33:48.7326938Z x0 = x0.contiguous() 2025-05-07T20:33:48.7327195Z x1 = x1.contiguous() 2025-05-07T20:33:48.7327436Z 2025-05-07T20:33:48.7327626Z if scale_ub is not None: 2025-05-07T20:33:48.7327888Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:48.7328222Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:48.7328531Z ) 2025-05-07T20:33:48.7328718Z else: 2025-05-07T20:33:48.7328922Z scale_ub_tensor = None 2025-05-07T20:33:48.7329174Z 2025-05-07T20:33:48.7329407Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:48.7329730Z op = silu_mul_quant 2025-05-07T20:33:48.7329976Z if compiled: 2025-05-07T20:33:48.7330220Z op = torch.compile(op) 2025-05-07T20:33:48.7330513Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.7330783Z 2025-05-07T20:33:48.7330970Z > y_fp8, y_scale = fn() 2025-05-07T20:33:48.7331131Z 2025-05-07T20:33:48.7331228Z moe/activation_test.py:117: 2025-05-07T20:33:48.7331526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.7331862Z moe/activation_test.py:115: in fn 2025-05-07T20:33:48.7332140Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:48.7332722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:48.7333303Z return fn(*args, **kwargs) 2025-05-07T20:33:48.7333986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:48.7334812Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:48.7335367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:48.7336078Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:48.7336764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:48.7337406Z kernel = self.compile( 2025-05-07T20:33:48.7337976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:48.7338668Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:48.7339129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:48.7339369Z 2025-05-07T20:33:48.7339605Z self = 2025-05-07T20:33:48.7340755Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:48.7342247Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d054180>} 2025-05-07T20:33:48.7343664Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:48.7344743Z context = 2025-05-07T20:33:48.7345047Z 2025-05-07T20:33:48.7345218Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:48.7345812Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:48.7346299Z module_map=module_map) 2025-05-07T20:33:48.7346667Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:48.7347029Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:48.7347295Z E ^ 2025-05-07T20:33:48.7347764Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:48.7348242Z 2025-05-07T20:33:48.7348676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:48.7349213Z 2025-05-07T20:33:48.7349318Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:48.7349741Z self=, 2025-05-07T20:33:48.7350150Z T=128, 2025-05-07T20:33:48.7350337Z D=5120, 2025-05-07T20:33:48.7350527Z scale_ub=1200.0, 2025-05-07T20:33:48.7350741Z contiguous=False, 2025-05-07T20:33:48.7350962Z compiled=True, 2025-05-07T20:33:48.7351156Z ) 2025-05-07T20:33:49.0299064Z self = 2025-05-07T20:33:49.0299742Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:49.0300224Z 2025-05-07T20:33:49.0300338Z @given( 2025-05-07T20:33:49.0300664Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.0301106Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.0301520Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.0301851Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.0302180Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.0302467Z ) 2025-05-07T20:33:49.0302816Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.0303262Z def test_silu_mul_quant( 2025-05-07T20:33:49.0303500Z self, 2025-05-07T20:33:49.0303690Z T: int, 2025-05-07T20:33:49.0303885Z D: int, 2025-05-07T20:33:49.0304101Z scale_ub: Optional[float], 2025-05-07T20:33:49.0304380Z contiguous: bool, 2025-05-07T20:33:49.0304621Z compiled: bool, 2025-05-07T20:33:49.0304848Z ) -> None: 2025-05-07T20:33:49.0305051Z torch.manual_seed(2025) 2025-05-07T20:33:49.0305286Z 2025-05-07T20:33:49.0305559Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.0306030Z 2025-05-07T20:33:49.0306222Z x_sign = torch.sign(x) 2025-05-07T20:33:49.0306515Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:49.0306816Z x = x_sign * x_clamp 2025-05-07T20:33:49.0307056Z x0 = x[:, :D] 2025-05-07T20:33:49.0307333Z x1 = x[:, D:] 2025-05-07T20:33:49.0307535Z 2025-05-07T20:33:49.0307725Z if contiguous: 2025-05-07T20:33:49.0314275Z x0 = x0.contiguous() 2025-05-07T20:33:49.0314570Z x1 = x1.contiguous() 2025-05-07T20:33:49.0314924Z 2025-05-07T20:33:49.0315114Z if scale_ub is not None: 2025-05-07T20:33:49.0315393Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:49.0315740Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:49.0316065Z ) 2025-05-07T20:33:49.0316255Z else: 2025-05-07T20:33:49.0316464Z scale_ub_tensor = None 2025-05-07T20:33:49.0316723Z 2025-05-07T20:33:49.0316951Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:49.0317280Z op = silu_mul_quant 2025-05-07T20:33:49.0317536Z if compiled: 2025-05-07T20:33:49.0317786Z op = torch.compile(op) 2025-05-07T20:33:49.0318090Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.0318377Z 2025-05-07T20:33:49.0318567Z > y_fp8, y_scale = fn() 2025-05-07T20:33:49.0318740Z 2025-05-07T20:33:49.0318960Z moe/activation_test.py:117: 2025-05-07T20:33:49.0319264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.0319604Z moe/activation_test.py:115: in fn 2025-05-07T20:33:49.0319891Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.0320477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:49.0321070Z return fn(*args, **kwargs) 2025-05-07T20:33:49.0321752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:49.0322479Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:49.0323045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:49.0323757Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:49.0324447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:49.0325009Z kernel = self.compile( 2025-05-07T20:33:49.0325920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:49.0326610Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:49.0327012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.0327247Z 2025-05-07T20:33:49.0327460Z self = 2025-05-07T20:33:49.0328584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:49.0330007Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d054ea0>} 2025-05-07T20:33:49.0331418Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:49.0332500Z context = 2025-05-07T20:33:49.0332797Z 2025-05-07T20:33:49.0332970Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:49.0333596Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:49.0334072Z module_map=module_map) 2025-05-07T20:33:49.0334440Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:49.0334982Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:49.0335238Z E ^ 2025-05-07T20:33:49.0335717Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:49.0336189Z 2025-05-07T20:33:49.0336696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:49.0337243Z 2025-05-07T20:33:49.0337352Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.0337770Z self=, 2025-05-07T20:33:49.0338194Z T=16384, 2025-05-07T20:33:49.0338402Z D=7168, 2025-05-07T20:33:49.0338597Z scale_ub=1200.0, 2025-05-07T20:33:49.0338820Z contiguous=True, 2025-05-07T20:33:49.0339042Z compiled=True, 2025-05-07T20:33:49.0339245Z ) 2025-05-07T20:33:49.0339593Z self = 2025-05-07T20:33:49.0340140Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:49.0340429Z 2025-05-07T20:33:49.0340512Z @given( 2025-05-07T20:33:49.0340803Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.0341125Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.0341440Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.0341771Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.0342165Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.0342520Z ) 2025-05-07T20:33:49.0342872Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.0343415Z def test_silu_mul_quant( 2025-05-07T20:33:49.0343657Z self, 2025-05-07T20:33:49.0343869Z T: int, 2025-05-07T20:33:49.0344124Z D: int, 2025-05-07T20:33:49.0344344Z scale_ub: Optional[float], 2025-05-07T20:33:49.0344618Z contiguous: bool, 2025-05-07T20:33:49.0344944Z compiled: bool, 2025-05-07T20:33:49.0345194Z ) -> None: 2025-05-07T20:33:49.0345417Z torch.manual_seed(2025) 2025-05-07T20:33:49.0345716Z 2025-05-07T20:33:49.0346031Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.0346382Z 2025-05-07T20:33:49.0346572Z x_sign = torch.sign(x) 2025-05-07T20:33:49.0346873Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:49.0347194Z x = x_sign * x_clamp 2025-05-07T20:33:49.0347437Z x0 = x[:, :D] 2025-05-07T20:33:49.0347656Z x1 = x[:, D:] 2025-05-07T20:33:49.0347871Z 2025-05-07T20:33:49.0348051Z if contiguous: 2025-05-07T20:33:49.0348287Z x0 = x0.contiguous() 2025-05-07T20:33:49.0348555Z x1 = x1.contiguous() 2025-05-07T20:33:49.0348803Z 2025-05-07T20:33:49.0349001Z if scale_ub is not None: 2025-05-07T20:33:49.0349279Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:49.0349625Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:49.0349936Z ) 2025-05-07T20:33:49.0350141Z else: 2025-05-07T20:33:49.0350362Z scale_ub_tensor = None 2025-05-07T20:33:49.0350622Z 2025-05-07T20:33:49.0350867Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:49.0351199Z op = silu_mul_quant 2025-05-07T20:33:49.0351447Z if compiled: 2025-05-07T20:33:49.0351694Z op = torch.compile(op) 2025-05-07T20:33:49.0351998Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.0352269Z 2025-05-07T20:33:49.0352460Z > y_fp8, y_scale = fn() 2025-05-07T20:33:49.0352699Z 2025-05-07T20:33:49.0352799Z moe/activation_test.py:117: 2025-05-07T20:33:49.0353096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.0353429Z moe/activation_test.py:115: in fn 2025-05-07T20:33:49.0353716Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.0354339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:49.0354931Z return fn(*args, **kwargs) 2025-05-07T20:33:49.0355623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:49.0356396Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:49.0356967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:49.0357683Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:49.0358384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:49.0358951Z kernel = self.compile( 2025-05-07T20:33:49.0359557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:49.0360264Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:49.0360677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.0360954Z 2025-05-07T20:33:49.0361172Z self = 2025-05-07T20:33:49.0362295Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:49.0363718Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d0560c0>} 2025-05-07T20:33:49.0365133Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:49.0366222Z context = 2025-05-07T20:33:49.0366522Z 2025-05-07T20:33:49.0366698Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:49.0367239Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:49.0367728Z module_map=module_map) 2025-05-07T20:33:49.0368103Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:49.0368461Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:49.0368730Z E ^ 2025-05-07T20:33:49.0369213Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:49.0369687Z 2025-05-07T20:33:49.0370181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:49.0370725Z 2025-05-07T20:33:49.1591347Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.1592036Z self=, 2025-05-07T20:33:49.1592626Z T=16384, 2025-05-07T20:33:49.1592898Z D=5120, 2025-05-07T20:33:49.1593159Z scale_ub=1200.0, 2025-05-07T20:33:49.1593466Z contiguous=True, 2025-05-07T20:33:49.1593728Z compiled=False, 2025-05-07T20:33:49.1593929Z ) 2025-05-07T20:33:49.1594255Z self = 2025-05-07T20:33:49.1594783Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:49.1595072Z 2025-05-07T20:33:49.1595149Z @given( 2025-05-07T20:33:49.1595499Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.1595823Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.1596146Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.1596478Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.1596883Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.1597180Z ) 2025-05-07T20:33:49.1597527Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.1597979Z def test_silu_mul_quant( 2025-05-07T20:33:49.1598284Z self, 2025-05-07T20:33:49.1598467Z T: int, 2025-05-07T20:33:49.1598654Z D: int, 2025-05-07T20:33:49.1598869Z scale_ub: Optional[float], 2025-05-07T20:33:49.1599134Z contiguous: bool, 2025-05-07T20:33:49.1599392Z compiled: bool, 2025-05-07T20:33:49.1599634Z ) -> None: 2025-05-07T20:33:49.1599838Z torch.manual_seed(2025) 2025-05-07T20:33:49.1600080Z 2025-05-07T20:33:49.1600349Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.1600694Z 2025-05-07T20:33:49.1600881Z x_sign = torch.sign(x) 2025-05-07T20:33:49.1601166Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:49.1601478Z x = x_sign * x_clamp 2025-05-07T20:33:49.1601709Z x0 = x[:, :D] 2025-05-07T20:33:49.1601924Z x1 = x[:, D:] 2025-05-07T20:33:49.1602127Z 2025-05-07T20:33:49.1602367Z if contiguous: 2025-05-07T20:33:49.1602602Z x0 = x0.contiguous() 2025-05-07T20:33:49.1602867Z x1 = x1.contiguous() 2025-05-07T20:33:49.1603104Z 2025-05-07T20:33:49.1603293Z if scale_ub is not None: 2025-05-07T20:33:49.1603562Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:49.1603890Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:49.1604202Z ) 2025-05-07T20:33:49.1604388Z else: 2025-05-07T20:33:49.1604593Z scale_ub_tensor = None 2025-05-07T20:33:49.1604842Z 2025-05-07T20:33:49.1605070Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:49.1605389Z op = silu_mul_quant 2025-05-07T20:33:49.1605628Z if compiled: 2025-05-07T20:33:49.1605899Z op = torch.compile(op) 2025-05-07T20:33:49.1606199Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.1606472Z 2025-05-07T20:33:49.1606661Z > y_fp8, y_scale = fn() 2025-05-07T20:33:49.1606831Z 2025-05-07T20:33:49.1606924Z moe/activation_test.py:117: 2025-05-07T20:33:49.1607231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.1607568Z moe/activation_test.py:115: in fn 2025-05-07T20:33:49.1607844Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.1608562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:49.1609289Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:49.1609842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:49.1613226Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:49.1613925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:49.1614482Z kernel = self.compile( 2025-05-07T20:33:49.1615207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:49.1615899Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:49.1616311Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.1616551Z 2025-05-07T20:33:49.1616764Z self = 2025-05-07T20:33:49.1617881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:49.1619391Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d055a80>} 2025-05-07T20:33:49.1620797Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:49.1621925Z context = 2025-05-07T20:33:49.1622230Z 2025-05-07T20:33:49.1622399Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:49.1622945Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:49.1623428Z module_map=module_map) 2025-05-07T20:33:49.1623802Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:49.1624169Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:49.1624432Z E ^ 2025-05-07T20:33:49.1624919Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:49.1625666Z 2025-05-07T20:33:49.1626185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:49.1626735Z 2025-05-07T20:33:49.1626846Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.1627269Z self=, 2025-05-07T20:33:49.1627691Z T=1, 2025-05-07T20:33:49.1627881Z D=7168, 2025-05-07T20:33:49.1628077Z scale_ub=1200.0, 2025-05-07T20:33:49.1628303Z contiguous=False, 2025-05-07T20:33:49.1628532Z compiled=False, 2025-05-07T20:33:49.1628748Z ) 2025-05-07T20:33:49.1629072Z self = 2025-05-07T20:33:49.1629581Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:49.1629861Z 2025-05-07T20:33:49.1629942Z @given( 2025-05-07T20:33:49.1630175Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.1630493Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.1630817Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.1631158Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.1631493Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.1631791Z ) 2025-05-07T20:33:49.1632146Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.1632602Z def test_silu_mul_quant( 2025-05-07T20:33:49.1632848Z self, 2025-05-07T20:33:49.1633046Z T: int, 2025-05-07T20:33:49.1633238Z D: int, 2025-05-07T20:33:49.1633452Z scale_ub: Optional[float], 2025-05-07T20:33:49.1633721Z contiguous: bool, 2025-05-07T20:33:49.1633956Z compiled: bool, 2025-05-07T20:33:49.1634290Z ) -> None: 2025-05-07T20:33:49.1634502Z torch.manual_seed(2025) 2025-05-07T20:33:49.1634739Z 2025-05-07T20:33:49.1635013Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.1635361Z 2025-05-07T20:33:49.1635552Z x_sign = torch.sign(x) 2025-05-07T20:33:49.1635843Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:49.1636156Z x = x_sign * x_clamp 2025-05-07T20:33:49.1636389Z x0 = x[:, :D] 2025-05-07T20:33:49.1636602Z x1 = x[:, D:] 2025-05-07T20:33:49.1636807Z 2025-05-07T20:33:49.1636987Z if contiguous: 2025-05-07T20:33:49.1637219Z x0 = x0.contiguous() 2025-05-07T20:33:49.1637472Z x1 = x1.contiguous() 2025-05-07T20:33:49.1637717Z 2025-05-07T20:33:49.1637902Z if scale_ub is not None: 2025-05-07T20:33:49.1638176Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:49.1638519Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:49.1638833Z ) 2025-05-07T20:33:49.1639022Z else: 2025-05-07T20:33:49.1639300Z scale_ub_tensor = None 2025-05-07T20:33:49.1639550Z 2025-05-07T20:33:49.1639779Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:49.1640101Z op = silu_mul_quant 2025-05-07T20:33:49.1640408Z if compiled: 2025-05-07T20:33:49.1640661Z op = torch.compile(op) 2025-05-07T20:33:49.1640966Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.1641238Z 2025-05-07T20:33:49.1641433Z > y_fp8, y_scale = fn() 2025-05-07T20:33:49.1641597Z 2025-05-07T20:33:49.1641699Z moe/activation_test.py:117: 2025-05-07T20:33:49.1641997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.1642333Z moe/activation_test.py:115: in fn 2025-05-07T20:33:49.1642616Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.1643339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:49.1644059Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:49.1644663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:49.1645386Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:49.1646091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:49.1646648Z kernel = self.compile( 2025-05-07T20:33:49.1647216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:49.1647911Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:49.1648320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.1648568Z 2025-05-07T20:33:49.1648788Z self = 2025-05-07T20:33:49.1649976Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:49.1651407Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1cdfc0e0>} 2025-05-07T20:33:49.1652817Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:49.1653900Z context = 2025-05-07T20:33:49.1654207Z 2025-05-07T20:33:49.1654379Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:49.1655076Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:49.1655567Z module_map=module_map) 2025-05-07T20:33:49.1655939Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:49.1656304Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:49.1656575Z E ^ 2025-05-07T20:33:49.1657046Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:49.1657525Z 2025-05-07T20:33:49.1657961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:49.1658511Z 2025-05-07T20:33:49.3383735Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.3384350Z self=, 2025-05-07T20:33:49.3384927Z T=4096, 2025-05-07T20:33:49.3385188Z D=7168, 2025-05-07T20:33:49.3385477Z scale_ub=1200.0, 2025-05-07T20:33:49.3385712Z contiguous=False, 2025-05-07T20:33:49.3386065Z compiled=True, 2025-05-07T20:33:49.3386273Z ) 2025-05-07T20:33:49.3386590Z self = 2025-05-07T20:33:49.3387100Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:49.3387460Z 2025-05-07T20:33:49.3387537Z @given( 2025-05-07T20:33:49.3387765Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.3388079Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.3388389Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.3388719Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.3389044Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.3389354Z ) 2025-05-07T20:33:49.3389716Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.3390185Z def test_silu_mul_quant( 2025-05-07T20:33:49.3390432Z self, 2025-05-07T20:33:49.3390643Z T: int, 2025-05-07T20:33:49.3390849Z D: int, 2025-05-07T20:33:49.3391070Z scale_ub: Optional[float], 2025-05-07T20:33:49.3391352Z contiguous: bool, 2025-05-07T20:33:49.3391666Z compiled: bool, 2025-05-07T20:33:49.3391890Z ) -> None: 2025-05-07T20:33:49.3392105Z torch.manual_seed(2025) 2025-05-07T20:33:49.3392345Z 2025-05-07T20:33:49.3392621Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.3392973Z 2025-05-07T20:33:49.3393167Z x_sign = torch.sign(x) 2025-05-07T20:33:49.3393457Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:49.3393769Z x = x_sign * x_clamp 2025-05-07T20:33:49.3394010Z x0 = x[:, :D] 2025-05-07T20:33:49.3394223Z x1 = x[:, D:] 2025-05-07T20:33:49.3394429Z 2025-05-07T20:33:49.3394614Z if contiguous: 2025-05-07T20:33:49.3394854Z x0 = x0.contiguous() 2025-05-07T20:33:49.3395111Z x1 = x1.contiguous() 2025-05-07T20:33:49.3395358Z 2025-05-07T20:33:49.3395552Z if scale_ub is not None: 2025-05-07T20:33:49.3395823Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:49.3396168Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:49.3396490Z ) 2025-05-07T20:33:49.3396676Z else: 2025-05-07T20:33:49.3396887Z scale_ub_tensor = None 2025-05-07T20:33:49.3397135Z 2025-05-07T20:33:49.3397362Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:49.3397684Z op = silu_mul_quant 2025-05-07T20:33:49.3397939Z if compiled: 2025-05-07T20:33:49.3398194Z op = torch.compile(op) 2025-05-07T20:33:49.3398494Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.3398778Z 2025-05-07T20:33:49.3398969Z > y_fp8, y_scale = fn() 2025-05-07T20:33:49.3399135Z 2025-05-07T20:33:49.3399309Z moe/activation_test.py:117: 2025-05-07T20:33:49.3399614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.3399954Z moe/activation_test.py:115: in fn 2025-05-07T20:33:49.3400232Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.3400817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:49.3401405Z return fn(*args, **kwargs) 2025-05-07T20:33:49.3402084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:49.3402799Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:49.3403357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:49.3404066Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:49.3404753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:49.3405364Z kernel = self.compile( 2025-05-07T20:33:49.3405926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:49.3406611Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:49.3407060Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.3407301Z 2025-05-07T20:33:49.3407510Z self = 2025-05-07T20:33:49.3408634Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:49.3410063Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1cdfd300>} 2025-05-07T20:33:49.3411551Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:49.3412640Z context = 2025-05-07T20:33:49.3412944Z 2025-05-07T20:33:49.3413116Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:49.3413659Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:49.3414152Z module_map=module_map) 2025-05-07T20:33:49.3420592Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:49.3420965Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:49.3421230Z E ^ 2025-05-07T20:33:49.3421705Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:49.3422203Z 2025-05-07T20:33:49.3422651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:49.3423205Z 2025-05-07T20:33:49.3423321Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.3423759Z self=, 2025-05-07T20:33:49.3424177Z T=128, 2025-05-07T20:33:49.3424370Z D=7168, 2025-05-07T20:33:49.3424562Z scale_ub=1200.0, 2025-05-07T20:33:49.3424782Z contiguous=False, 2025-05-07T20:33:49.3425005Z compiled=True, 2025-05-07T20:33:49.3425210Z ) 2025-05-07T20:33:49.4336287Z self = 2025-05-07T20:33:49.4336894Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:49.4337257Z 2025-05-07T20:33:49.4337368Z @given( 2025-05-07T20:33:49.4337794Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.4338113Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.4338423Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.4338759Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.4339093Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.4339394Z ) 2025-05-07T20:33:49.4339747Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.4340204Z def test_silu_mul_quant( 2025-05-07T20:33:49.4340443Z self, 2025-05-07T20:33:49.4340636Z T: int, 2025-05-07T20:33:49.4340830Z D: int, 2025-05-07T20:33:49.4341044Z scale_ub: Optional[float], 2025-05-07T20:33:49.4341319Z contiguous: bool, 2025-05-07T20:33:49.4341562Z compiled: bool, 2025-05-07T20:33:49.4341779Z ) -> None: 2025-05-07T20:33:49.4341994Z torch.manual_seed(2025) 2025-05-07T20:33:49.4342237Z 2025-05-07T20:33:49.4342513Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.4342938Z 2025-05-07T20:33:49.4343141Z x_sign = torch.sign(x) 2025-05-07T20:33:49.4343427Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:49.4343746Z x = x_sign * x_clamp 2025-05-07T20:33:49.4344047Z x0 = x[:, :D] 2025-05-07T20:33:49.4344253Z x1 = x[:, D:] 2025-05-07T20:33:49.4344461Z 2025-05-07T20:33:49.4344647Z if contiguous: 2025-05-07T20:33:49.4344870Z x0 = x0.contiguous() 2025-05-07T20:33:49.4345133Z x1 = x1.contiguous() 2025-05-07T20:33:49.4345374Z 2025-05-07T20:33:49.4345571Z if scale_ub is not None: 2025-05-07T20:33:49.4345840Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:49.4346180Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:49.4346492Z ) 2025-05-07T20:33:49.4346676Z else: 2025-05-07T20:33:49.4346892Z scale_ub_tensor = None 2025-05-07T20:33:49.4347143Z 2025-05-07T20:33:49.4347377Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:49.4347701Z op = silu_mul_quant 2025-05-07T20:33:49.4347951Z if compiled: 2025-05-07T20:33:49.4348253Z op = torch.compile(op) 2025-05-07T20:33:49.4348560Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.4348843Z 2025-05-07T20:33:49.4349026Z > y_fp8, y_scale = fn() 2025-05-07T20:33:49.4349194Z 2025-05-07T20:33:49.4349296Z moe/activation_test.py:117: 2025-05-07T20:33:49.4349593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.4349956Z moe/activation_test.py:115: in fn 2025-05-07T20:33:49.4350242Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.4350822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:49.4351409Z return fn(*args, **kwargs) 2025-05-07T20:33:49.4352103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:49.4352820Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:49.4353379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:49.4354098Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:49.4354794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:49.4355348Z kernel = self.compile( 2025-05-07T20:33:49.4355913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:49.4356601Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:49.4357001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.4357302Z 2025-05-07T20:33:49.4357516Z self = 2025-05-07T20:33:49.4358652Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:49.4360134Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1cdfe020>} 2025-05-07T20:33:49.4361542Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:49.4362620Z context = 2025-05-07T20:33:49.4362925Z 2025-05-07T20:33:49.4363095Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:49.4363680Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:49.4364158Z module_map=module_map) 2025-05-07T20:33:49.4364527Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:49.4364930Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:49.4365193Z E ^ 2025-05-07T20:33:49.4365664Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:49.4366140Z 2025-05-07T20:33:49.4366575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:49.4367117Z 2025-05-07T20:33:49.4367220Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.4367640Z self=, 2025-05-07T20:33:49.4368053Z T=2048, 2025-05-07T20:33:49.4368243Z D=7168, 2025-05-07T20:33:49.4368437Z scale_ub=None, 2025-05-07T20:33:49.4368647Z contiguous=True, 2025-05-07T20:33:49.4368871Z compiled=True, 2025-05-07T20:33:49.4369069Z ) 2025-05-07T20:33:49.4369462Z self = 2025-05-07T20:33:49.4369990Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:49.4370271Z 2025-05-07T20:33:49.4370349Z @given( 2025-05-07T20:33:49.4370578Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.4370887Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.4371194Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.4371529Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.4371860Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.4372153Z ) 2025-05-07T20:33:49.4372509Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.4372962Z def test_silu_mul_quant( 2025-05-07T20:33:49.4373205Z self, 2025-05-07T20:33:49.4373400Z T: int, 2025-05-07T20:33:49.4373590Z D: int, 2025-05-07T20:33:49.4373806Z scale_ub: Optional[float], 2025-05-07T20:33:49.4374083Z contiguous: bool, 2025-05-07T20:33:49.4374321Z compiled: bool, 2025-05-07T20:33:49.4374676Z ) -> None: 2025-05-07T20:33:49.4374893Z torch.manual_seed(2025) 2025-05-07T20:33:49.4375141Z 2025-05-07T20:33:49.4375410Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.4375763Z 2025-05-07T20:33:49.4375962Z x_sign = torch.sign(x) 2025-05-07T20:33:49.4376249Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:49.4376570Z x = x_sign * x_clamp 2025-05-07T20:33:49.4376818Z x0 = x[:, :D] 2025-05-07T20:33:49.4377030Z x1 = x[:, D:] 2025-05-07T20:33:49.4377239Z 2025-05-07T20:33:49.4377488Z if contiguous: 2025-05-07T20:33:49.4377717Z x0 = x0.contiguous() 2025-05-07T20:33:49.4377985Z x1 = x1.contiguous() 2025-05-07T20:33:49.4378235Z 2025-05-07T20:33:49.4378427Z if scale_ub is not None: 2025-05-07T20:33:49.4378718Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:49.4379067Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:49.4379383Z ) 2025-05-07T20:33:49.4379577Z else: 2025-05-07T20:33:49.4379787Z scale_ub_tensor = None 2025-05-07T20:33:49.4380045Z 2025-05-07T20:33:49.4380270Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:49.4380595Z op = silu_mul_quant 2025-05-07T20:33:49.4380847Z if compiled: 2025-05-07T20:33:49.4381094Z op = torch.compile(op) 2025-05-07T20:33:49.4381399Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.4381674Z 2025-05-07T20:33:49.4381863Z > y_fp8, y_scale = fn() 2025-05-07T20:33:49.4382037Z 2025-05-07T20:33:49.4382184Z moe/activation_test.py:117: 2025-05-07T20:33:49.4382480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.4382814Z moe/activation_test.py:115: in fn 2025-05-07T20:33:49.4383105Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.4383724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:49.4384313Z return fn(*args, **kwargs) 2025-05-07T20:33:49.4384999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:49.4385722Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:49.4386279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:49.4386985Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:49.4387689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:49.4388245Z kernel = self.compile( 2025-05-07T20:33:49.4388848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:49.4389541Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:49.4389956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.4390196Z 2025-05-07T20:33:49.4390415Z self = 2025-05-07T20:33:49.4391539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:49.4392962Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1cdff240>} 2025-05-07T20:33:49.4394376Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:49.4395459Z context = 2025-05-07T20:33:49.4395759Z 2025-05-07T20:33:49.4395934Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:49.4396465Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:49.4396940Z module_map=module_map) 2025-05-07T20:33:49.4397310Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:49.4397669Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:49.4397932Z E ^ 2025-05-07T20:33:49.4398463Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:49.4398934Z 2025-05-07T20:33:49.4399383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:49.4399974Z 2025-05-07T20:33:49.5003509Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.5004416Z self=, 2025-05-07T20:33:49.5005496Z T=16384, 2025-05-07T20:33:49.5005876Z D=5120, 2025-05-07T20:33:49.5006246Z scale_ub=None, 2025-05-07T20:33:49.5006659Z contiguous=False, 2025-05-07T20:33:49.5007107Z compiled=False, 2025-05-07T20:33:49.5007506Z ) 2025-05-07T20:33:49.5008140Z self = 2025-05-07T20:33:49.5009147Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:49.5009724Z 2025-05-07T20:33:49.5009812Z @given( 2025-05-07T20:33:49.5010186Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.5010511Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.5010819Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.5011146Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.5011573Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.5011864Z ) 2025-05-07T20:33:49.5012212Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.5012665Z def test_silu_mul_quant( 2025-05-07T20:33:49.5012902Z self, 2025-05-07T20:33:49.5013083Z T: int, 2025-05-07T20:33:49.5013281Z D: int, 2025-05-07T20:33:49.5013500Z scale_ub: Optional[float], 2025-05-07T20:33:49.5013776Z contiguous: bool, 2025-05-07T20:33:49.5014006Z compiled: bool, 2025-05-07T20:33:49.5014230Z ) -> None: 2025-05-07T20:33:49.5014445Z torch.manual_seed(2025) 2025-05-07T20:33:49.5014785Z 2025-05-07T20:33:49.5015066Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.5015420Z 2025-05-07T20:33:49.5015607Z x_sign = torch.sign(x) 2025-05-07T20:33:49.5015963Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:49.5018133Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:49.5020149Z 2025-05-07T20:33:49.5020267Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:49.5020484Z 2025-05-07T20:33:49.5020592Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.5021013Z self=, 2025-05-07T20:33:49.5021431Z T=4096, 2025-05-07T20:33:49.5021617Z D=7168, 2025-05-07T20:33:49.5021802Z scale_ub=1200.0, 2025-05-07T20:33:49.5022017Z contiguous=True, 2025-05-07T20:33:49.5022236Z compiled=True, 2025-05-07T20:33:49.5022437Z ) 2025-05-07T20:33:49.5022751Z self = 2025-05-07T20:33:49.5023256Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:49.5023532Z 2025-05-07T20:33:49.5023618Z @given( 2025-05-07T20:33:49.5023837Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.5024155Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.5024458Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.5024862Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.5025194Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.5025661Z ) 2025-05-07T20:33:49.5026010Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.5026465Z def test_silu_mul_quant( 2025-05-07T20:33:49.5026705Z self, 2025-05-07T20:33:49.5026899Z T: int, 2025-05-07T20:33:49.5027090Z D: int, 2025-05-07T20:33:49.5027305Z scale_ub: Optional[float], 2025-05-07T20:33:49.5027581Z contiguous: bool, 2025-05-07T20:33:49.5027816Z compiled: bool, 2025-05-07T20:33:49.5028041Z ) -> None: 2025-05-07T20:33:49.5028279Z torch.manual_seed(2025) 2025-05-07T20:33:49.5028517Z 2025-05-07T20:33:49.5028784Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.5029139Z 2025-05-07T20:33:49.5029338Z x_sign = torch.sign(x) 2025-05-07T20:33:49.5029626Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:49.5031888Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:49.5033961Z 2025-05-07T20:33:49.5034078Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:49.5034305Z 2025-05-07T20:33:49.5034406Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.5034823Z self=, 2025-05-07T20:33:49.5035232Z T=16384, 2025-05-07T20:33:49.5035423Z D=7168, 2025-05-07T20:33:49.5035609Z scale_ub=None, 2025-05-07T20:33:49.5035817Z contiguous=False, 2025-05-07T20:33:49.5036042Z compiled=False, 2025-05-07T20:33:49.5036247Z ) 2025-05-07T20:33:49.5036617Z self = 2025-05-07T20:33:49.5037129Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:49.5037422Z 2025-05-07T20:33:49.5037498Z @given( 2025-05-07T20:33:49.5037725Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.5038033Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.5038342Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.5038671Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.5038998Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.5039287Z ) 2025-05-07T20:33:49.5039636Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.5040134Z def test_silu_mul_quant( 2025-05-07T20:33:49.5040370Z self, 2025-05-07T20:33:49.5040563Z T: int, 2025-05-07T20:33:49.5040757Z D: int, 2025-05-07T20:33:49.5040964Z scale_ub: Optional[float], 2025-05-07T20:33:49.5041241Z contiguous: bool, 2025-05-07T20:33:49.5041477Z compiled: bool, 2025-05-07T20:33:49.5041695Z ) -> None: 2025-05-07T20:33:49.5041904Z torch.manual_seed(2025) 2025-05-07T20:33:49.5042142Z 2025-05-07T20:33:49.5042407Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.5044605Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:49.5046689Z 2025-05-07T20:33:49.5046806Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:49.5047022Z 2025-05-07T20:33:49.5047125Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.5047544Z self=, 2025-05-07T20:33:49.5047951Z T=2048, 2025-05-07T20:33:49.5048140Z D=7168, 2025-05-07T20:33:49.5048328Z scale_ub=1200.0, 2025-05-07T20:33:49.5048542Z contiguous=True, 2025-05-07T20:33:49.5048757Z compiled=True, 2025-05-07T20:33:49.5048956Z ) 2025-05-07T20:33:49.5049287Z self = 2025-05-07T20:33:49.5049821Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:49.5050100Z 2025-05-07T20:33:49.5050190Z @given( 2025-05-07T20:33:49.5050410Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.5050777Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.5051096Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.5051437Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.5051769Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.5052098Z ) 2025-05-07T20:33:49.5052451Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.5052902Z def test_silu_mul_quant( 2025-05-07T20:33:49.5053145Z self, 2025-05-07T20:33:49.5053339Z T: int, 2025-05-07T20:33:49.5053531Z D: int, 2025-05-07T20:33:49.5053752Z scale_ub: Optional[float], 2025-05-07T20:33:49.5054016Z contiguous: bool, 2025-05-07T20:33:49.5054245Z compiled: bool, 2025-05-07T20:33:49.5054462Z ) -> None: 2025-05-07T20:33:49.5054749Z torch.manual_seed(2025) 2025-05-07T20:33:49.5054987Z 2025-05-07T20:33:49.5055257Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.5055608Z 2025-05-07T20:33:49.5055795Z x_sign = torch.sign(x) 2025-05-07T20:33:49.5056145Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:49.5058270Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:49.5060265Z 2025-05-07T20:33:49.5060383Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:49.5060601Z 2025-05-07T20:33:49.5060707Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.5061128Z self=, 2025-05-07T20:33:49.5061545Z T=2048, 2025-05-07T20:33:49.5061726Z D=7168, 2025-05-07T20:33:49.5061913Z scale_ub=None, 2025-05-07T20:33:49.5062119Z contiguous=True, 2025-05-07T20:33:49.5062341Z compiled=False, 2025-05-07T20:33:49.5062541Z ) 2025-05-07T20:33:49.6188819Z self = 2025-05-07T20:33:49.6189345Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:49.6189669Z 2025-05-07T20:33:49.6189751Z @given( 2025-05-07T20:33:49.6189978Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.6190287Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.6190603Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.6190941Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.6191395Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.6191679Z ) 2025-05-07T20:33:49.6192030Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.6192487Z def test_silu_mul_quant( 2025-05-07T20:33:49.6192720Z self, 2025-05-07T20:33:49.6192910Z T: int, 2025-05-07T20:33:49.6193102Z D: int, 2025-05-07T20:33:49.6193314Z scale_ub: Optional[float], 2025-05-07T20:33:49.6193588Z contiguous: bool, 2025-05-07T20:33:49.6193826Z compiled: bool, 2025-05-07T20:33:49.6194039Z ) -> None: 2025-05-07T20:33:49.6194252Z torch.manual_seed(2025) 2025-05-07T20:33:49.6194491Z 2025-05-07T20:33:49.6194757Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.6195109Z 2025-05-07T20:33:49.6195299Z > x_sign = torch.sign(x) 2025-05-07T20:33:49.6197449Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:49.6199555Z 2025-05-07T20:33:49.6199677Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:49.6199894Z 2025-05-07T20:33:49.6199994Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.6200418Z self=, 2025-05-07T20:33:49.6200838Z T=1, 2025-05-07T20:33:49.6201013Z D=7168, 2025-05-07T20:33:49.6201203Z scale_ub=1200.0, 2025-05-07T20:33:49.6201425Z contiguous=True, 2025-05-07T20:33:49.6201645Z compiled=False, 2025-05-07T20:33:49.6201844Z ) 2025-05-07T20:33:49.6202168Z self = 2025-05-07T20:33:49.6202667Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:49.6203007Z 2025-05-07T20:33:49.6203089Z @given( 2025-05-07T20:33:49.6203327Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.6203645Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.6203946Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.6204278Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.6204610Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.6204894Z ) 2025-05-07T20:33:49.6205241Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.6205692Z def test_silu_mul_quant( 2025-05-07T20:33:49.6205928Z self, 2025-05-07T20:33:49.6206123Z T: int, 2025-05-07T20:33:49.6206314Z D: int, 2025-05-07T20:33:49.6206529Z scale_ub: Optional[float], 2025-05-07T20:33:49.6206799Z contiguous: bool, 2025-05-07T20:33:49.6207041Z compiled: bool, 2025-05-07T20:33:49.6207267Z ) -> None: 2025-05-07T20:33:49.6207477Z torch.manual_seed(2025) 2025-05-07T20:33:49.6207720Z 2025-05-07T20:33:49.6207986Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.6208327Z 2025-05-07T20:33:49.6208511Z x_sign = torch.sign(x) 2025-05-07T20:33:49.6208799Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:49.6209110Z x = x_sign * x_clamp 2025-05-07T20:33:49.6209351Z x0 = x[:, :D] 2025-05-07T20:33:49.6209564Z x1 = x[:, D:] 2025-05-07T20:33:49.6209763Z 2025-05-07T20:33:49.6209953Z if contiguous: 2025-05-07T20:33:49.6210188Z x0 = x0.contiguous() 2025-05-07T20:33:49.6217511Z x1 = x1.contiguous() 2025-05-07T20:33:49.6217863Z 2025-05-07T20:33:49.6218060Z if scale_ub is not None: 2025-05-07T20:33:49.6218349Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:49.6218686Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:49.6219003Z ) 2025-05-07T20:33:49.6219200Z else: 2025-05-07T20:33:49.6219436Z scale_ub_tensor = None 2025-05-07T20:33:49.6219717Z 2025-05-07T20:33:49.6219955Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:49.6220279Z op = silu_mul_quant 2025-05-07T20:33:49.6220541Z if compiled: 2025-05-07T20:33:49.6220802Z op = torch.compile(op) 2025-05-07T20:33:49.6221106Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.6221387Z 2025-05-07T20:33:49.6221576Z > y_fp8, y_scale = fn() 2025-05-07T20:33:49.6221742Z 2025-05-07T20:33:49.6221846Z moe/activation_test.py:117: 2025-05-07T20:33:49.6222144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.6222542Z moe/activation_test.py:115: in fn 2025-05-07T20:33:49.6222830Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.6223548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:49.6224327Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:49.6224893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:49.6225985Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:49.6226679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:49.6227240Z kernel = self.compile( 2025-05-07T20:33:49.6227811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:49.6228504Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:49.6228919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.6229166Z 2025-05-07T20:33:49.6229475Z self = 2025-05-07T20:33:49.6230613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:49.6232042Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1dd96520>} 2025-05-07T20:33:49.6233453Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:49.6234544Z context = 2025-05-07T20:33:49.6234841Z 2025-05-07T20:33:49.6235015Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:49.6235555Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:49.6236034Z module_map=module_map) 2025-05-07T20:33:49.6236404Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:49.6236768Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:49.6237033Z E ^ 2025-05-07T20:33:49.6237508Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:49.6237980Z 2025-05-07T20:33:49.6238425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:49.6238967Z 2025-05-07T20:33:49.6239147Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.6239619Z self=, 2025-05-07T20:33:49.6240043Z T=128, 2025-05-07T20:33:49.6240236Z D=5120, 2025-05-07T20:33:49.6240421Z scale_ub=None, 2025-05-07T20:33:49.6240638Z contiguous=True, 2025-05-07T20:33:49.6240870Z compiled=False, 2025-05-07T20:33:49.6241082Z ) 2025-05-07T20:33:49.6917407Z self = 2025-05-07T20:33:49.6917950Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:49.6918286Z 2025-05-07T20:33:49.6918401Z @given( 2025-05-07T20:33:49.6918723Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.6919142Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.6919870Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.6920608Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.6921288Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.6921868Z ) 2025-05-07T20:33:49.6922891Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.6923816Z def test_silu_mul_quant( 2025-05-07T20:33:49.6924303Z self, 2025-05-07T20:33:49.6924700Z T: int, 2025-05-07T20:33:49.6925217Z D: int, 2025-05-07T20:33:49.6926077Z scale_ub: Optional[float], 2025-05-07T20:33:49.6926749Z contiguous: bool, 2025-05-07T20:33:49.6927279Z compiled: bool, 2025-05-07T20:33:49.6927716Z ) -> None: 2025-05-07T20:33:49.6928133Z torch.manual_seed(2025) 2025-05-07T20:33:49.6928612Z 2025-05-07T20:33:49.6929153Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.6929613Z 2025-05-07T20:33:49.6929828Z x_sign = torch.sign(x) 2025-05-07T20:33:49.6930128Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:49.6930442Z x = x_sign * x_clamp 2025-05-07T20:33:49.6930681Z x0 = x[:, :D] 2025-05-07T20:33:49.6930906Z x1 = x[:, D:] 2025-05-07T20:33:49.6931110Z 2025-05-07T20:33:49.6931294Z if contiguous: 2025-05-07T20:33:49.6931527Z x0 = x0.contiguous() 2025-05-07T20:33:49.6931873Z x1 = x1.contiguous() 2025-05-07T20:33:49.6932126Z 2025-05-07T20:33:49.6932322Z if scale_ub is not None: 2025-05-07T20:33:49.6932592Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:49.6932937Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:49.6933258Z ) 2025-05-07T20:33:49.6933442Z else: 2025-05-07T20:33:49.6933652Z scale_ub_tensor = None 2025-05-07T20:33:49.6933910Z 2025-05-07T20:33:49.6934144Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:49.6934463Z op = silu_mul_quant 2025-05-07T20:33:49.6934869Z if compiled: 2025-05-07T20:33:49.6935118Z op = torch.compile(op) 2025-05-07T20:33:49.6935422Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.6935703Z 2025-05-07T20:33:49.6935895Z > y_fp8, y_scale = fn() 2025-05-07T20:33:49.6936085Z 2025-05-07T20:33:49.6936184Z moe/activation_test.py:117: 2025-05-07T20:33:49.6936480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.6936823Z moe/activation_test.py:115: in fn 2025-05-07T20:33:49.6937106Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.6937818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:49.6938538Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:49.6939095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:49.6939806Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:49.6940576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:49.6941130Z kernel = self.compile( 2025-05-07T20:33:49.6941693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:49.6942384Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:49.6942789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.6943028Z 2025-05-07T20:33:49.6943238Z self = 2025-05-07T20:33:49.6944357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:49.6945847Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1dd97420>} 2025-05-07T20:33:49.6947261Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:49.6948411Z context = 2025-05-07T20:33:49.6948719Z 2025-05-07T20:33:49.6948890Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:49.6949454Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:49.6949970Z module_map=module_map) 2025-05-07T20:33:49.6950346Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:49.6950717Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:49.6950983Z E ^ 2025-05-07T20:33:49.6951479Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:49.6951957Z 2025-05-07T20:33:49.6952441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:49.6952984Z 2025-05-07T20:33:49.6953095Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.6953528Z self=, 2025-05-07T20:33:49.6953953Z T=128, 2025-05-07T20:33:49.6954148Z D=7168, 2025-05-07T20:33:49.6954342Z scale_ub=None, 2025-05-07T20:33:49.6954564Z contiguous=True, 2025-05-07T20:33:49.6954793Z compiled=False, 2025-05-07T20:33:49.6955006Z ) 2025-05-07T20:33:49.6955332Z self = 2025-05-07T20:33:49.6955845Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:49.6956125Z 2025-05-07T20:33:49.6956216Z @given( 2025-05-07T20:33:49.6956444Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.6956776Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.6957094Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.6957429Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.6957769Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.6958071Z ) 2025-05-07T20:33:49.6958433Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.6958894Z def test_silu_mul_quant( 2025-05-07T20:33:49.6959151Z self, 2025-05-07T20:33:49.6959353Z T: int, 2025-05-07T20:33:49.6959574Z D: int, 2025-05-07T20:33:49.6959824Z scale_ub: Optional[float], 2025-05-07T20:33:49.6960104Z contiguous: bool, 2025-05-07T20:33:49.6960344Z compiled: bool, 2025-05-07T20:33:49.6960571Z ) -> None: 2025-05-07T20:33:49.6960791Z torch.manual_seed(2025) 2025-05-07T20:33:49.6961087Z 2025-05-07T20:33:49.6961364Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.6961708Z 2025-05-07T20:33:49.6961894Z x_sign = torch.sign(x) 2025-05-07T20:33:49.6962184Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:49.6962505Z x = x_sign * x_clamp 2025-05-07T20:33:49.6962750Z x0 = x[:, :D] 2025-05-07T20:33:49.6962971Z x1 = x[:, D:] 2025-05-07T20:33:49.6963186Z 2025-05-07T20:33:49.6963370Z if contiguous: 2025-05-07T20:33:49.6963598Z x0 = x0.contiguous() 2025-05-07T20:33:49.6963857Z x1 = x1.contiguous() 2025-05-07T20:33:49.6964098Z 2025-05-07T20:33:49.6964285Z if scale_ub is not None: 2025-05-07T20:33:49.6964560Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:49.6964903Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:49.6965209Z ) 2025-05-07T20:33:49.6965413Z else: 2025-05-07T20:33:49.6965621Z scale_ub_tensor = None 2025-05-07T20:33:49.6965942Z 2025-05-07T20:33:49.6966172Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:49.6966482Z op = silu_mul_quant 2025-05-07T20:33:49.6966737Z if compiled: 2025-05-07T20:33:49.6966980Z op = torch.compile(op) 2025-05-07T20:33:49.6967322Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.6967593Z 2025-05-07T20:33:49.6967780Z > y_fp8, y_scale = fn() 2025-05-07T20:33:49.6967945Z 2025-05-07T20:33:49.6968041Z moe/activation_test.py:117: 2025-05-07T20:33:49.6968339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.6968677Z moe/activation_test.py:115: in fn 2025-05-07T20:33:49.6968964Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.6969677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:49.6970400Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:49.6970963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:49.6971722Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:49.6972422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:49.6972974Z kernel = self.compile( 2025-05-07T20:33:49.6973534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:49.6974208Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:49.6974707Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.6974946Z 2025-05-07T20:33:49.6975159Z self = 2025-05-07T20:33:49.6976289Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:49.6977709Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1cc204a0>} 2025-05-07T20:33:49.6979124Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:49.6980264Z context = 2025-05-07T20:33:49.6980565Z 2025-05-07T20:33:49.6980738Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:49.6981276Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:49.6981816Z module_map=module_map) 2025-05-07T20:33:49.6982195Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:49.6982562Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:49.6982827Z E ^ 2025-05-07T20:33:49.6983309Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:49.6983782Z 2025-05-07T20:33:49.6984224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:49.6984764Z 2025-05-07T20:33:49.6984875Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.6985300Z self=, 2025-05-07T20:33:49.6985721Z T=2048, 2025-05-07T20:33:49.6985913Z D=7168, 2025-05-07T20:33:49.6986103Z scale_ub=1200.0, 2025-05-07T20:33:49.6986320Z contiguous=True, 2025-05-07T20:33:49.6986544Z compiled=False, 2025-05-07T20:33:49.6986738Z ) 2025-05-07T20:33:49.7792225Z self = 2025-05-07T20:33:49.7792802Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:49.7793092Z 2025-05-07T20:33:49.7793183Z @given( 2025-05-07T20:33:49.7793483Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.7793793Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.7794098Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.7794429Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.7794756Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.7795040Z ) 2025-05-07T20:33:49.7795389Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.7795837Z def test_silu_mul_quant( 2025-05-07T20:33:49.7796080Z self, 2025-05-07T20:33:49.7796270Z T: int, 2025-05-07T20:33:49.7796457Z D: int, 2025-05-07T20:33:49.7796673Z scale_ub: Optional[float], 2025-05-07T20:33:49.7796944Z contiguous: bool, 2025-05-07T20:33:49.7797179Z compiled: bool, 2025-05-07T20:33:49.7797404Z ) -> None: 2025-05-07T20:33:49.7797680Z torch.manual_seed(2025) 2025-05-07T20:33:49.7797922Z 2025-05-07T20:33:49.7798192Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.7800379Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:49.7802380Z 2025-05-07T20:33:49.7802499Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:49.7802716Z 2025-05-07T20:33:49.7802824Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.7803240Z self=, 2025-05-07T20:33:49.7803660Z T=1, 2025-05-07T20:33:49.7803844Z D=5120, 2025-05-07T20:33:49.7804030Z scale_ub=1200.0, 2025-05-07T20:33:49.7804253Z contiguous=True, 2025-05-07T20:33:49.7804472Z compiled=False, 2025-05-07T20:33:49.7804680Z ) 2025-05-07T20:33:49.7804992Z self = 2025-05-07T20:33:49.7805490Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:49.7805761Z 2025-05-07T20:33:49.7805846Z @given( 2025-05-07T20:33:49.7806067Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.7806456Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.7806771Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.7807098Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.7807430Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.7807721Z ) 2025-05-07T20:33:49.7808069Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.7808521Z def test_silu_mul_quant( 2025-05-07T20:33:49.7808758Z self, 2025-05-07T20:33:49.7808947Z T: int, 2025-05-07T20:33:49.7809134Z D: int, 2025-05-07T20:33:49.7809349Z scale_ub: Optional[float], 2025-05-07T20:33:49.7809620Z contiguous: bool, 2025-05-07T20:33:49.7809857Z compiled: bool, 2025-05-07T20:33:49.7810113Z ) -> None: 2025-05-07T20:33:49.7810332Z torch.manual_seed(2025) 2025-05-07T20:33:49.7810566Z 2025-05-07T20:33:49.7810842Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.7811196Z 2025-05-07T20:33:49.7811381Z x_sign = torch.sign(x) 2025-05-07T20:33:49.7811716Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:49.7812035Z x = x_sign * x_clamp 2025-05-07T20:33:49.7812266Z x0 = x[:, :D] 2025-05-07T20:33:49.7812480Z x1 = x[:, D:] 2025-05-07T20:33:49.7812723Z 2025-05-07T20:33:49.7812898Z if contiguous: 2025-05-07T20:33:49.7813125Z x0 = x0.contiguous() 2025-05-07T20:33:49.7813383Z x1 = x1.contiguous() 2025-05-07T20:33:49.7813623Z 2025-05-07T20:33:49.7813805Z if scale_ub is not None: 2025-05-07T20:33:49.7814075Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:49.7814406Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:49.7814808Z ) 2025-05-07T20:33:49.7815002Z else: 2025-05-07T20:33:49.7815221Z scale_ub_tensor = None 2025-05-07T20:33:49.7815470Z 2025-05-07T20:33:49.7815706Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:49.7816030Z op = silu_mul_quant 2025-05-07T20:33:49.7816276Z if compiled: 2025-05-07T20:33:49.7816525Z op = torch.compile(op) 2025-05-07T20:33:49.7816875Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.7817155Z 2025-05-07T20:33:49.7817351Z > y_fp8, y_scale = fn() 2025-05-07T20:33:49.7817518Z 2025-05-07T20:33:49.7817622Z moe/activation_test.py:117: 2025-05-07T20:33:49.7817921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.7818257Z moe/activation_test.py:115: in fn 2025-05-07T20:33:49.7818543Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:49.7819263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:49.7820012Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:49.7820565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:49.7821291Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:49.7821993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:49.7822567Z kernel = self.compile( 2025-05-07T20:33:49.7823123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:49.7823817Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:49.7824230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:49.7824465Z 2025-05-07T20:33:49.7824677Z self = 2025-05-07T20:33:49.7826127Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:49.7827679Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1cc21a80>} 2025-05-07T20:33:49.7829102Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:49.7830198Z context = 2025-05-07T20:33:49.7830501Z 2025-05-07T20:33:49.7830674Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:49.7831220Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:49.7831708Z module_map=module_map) 2025-05-07T20:33:49.7832090Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:49.7832521Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:49.7832800Z E ^ 2025-05-07T20:33:49.7833287Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:49.7833819Z 2025-05-07T20:33:49.7834263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:49.7834814Z 2025-05-07T20:33:49.7834924Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.7835355Z self=, 2025-05-07T20:33:49.7835778Z T=2048, 2025-05-07T20:33:49.7835963Z D=5120, 2025-05-07T20:33:49.7836154Z scale_ub=None, 2025-05-07T20:33:49.7836375Z contiguous=True, 2025-05-07T20:33:49.7836596Z compiled=False, 2025-05-07T20:33:49.7836804Z ) 2025-05-07T20:33:49.7837129Z self = 2025-05-07T20:33:49.7837642Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:49.7837928Z 2025-05-07T20:33:49.7838008Z @given( 2025-05-07T20:33:49.7838307Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.7838630Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.7838937Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.7839273Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.7839660Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.7839946Z ) 2025-05-07T20:33:49.7840305Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.7840768Z def test_silu_mul_quant( 2025-05-07T20:33:49.7841008Z self, 2025-05-07T20:33:49.7841199Z T: int, 2025-05-07T20:33:49.7841399Z D: int, 2025-05-07T20:33:49.7841614Z scale_ub: Optional[float], 2025-05-07T20:33:49.7841891Z contiguous: bool, 2025-05-07T20:33:49.7842136Z compiled: bool, 2025-05-07T20:33:49.7842351Z ) -> None: 2025-05-07T20:33:49.7842570Z torch.manual_seed(2025) 2025-05-07T20:33:49.7842811Z 2025-05-07T20:33:49.7843084Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.7843441Z 2025-05-07T20:33:49.7843638Z > x_sign = torch.sign(x) 2025-05-07T20:33:49.7845721Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:49.7847781Z 2025-05-07T20:33:49.7847913Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:49.7848134Z 2025-05-07T20:33:49.7848241Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.7848674Z self=, 2025-05-07T20:33:49.7849096Z T=16384, 2025-05-07T20:33:49.7849283Z D=5120, 2025-05-07T20:33:49.7849483Z scale_ub=None, 2025-05-07T20:33:49.7849707Z contiguous=True, 2025-05-07T20:33:49.7849945Z compiled=False, 2025-05-07T20:33:49.7850153Z ) 2025-05-07T20:33:49.8614927Z self = 2025-05-07T20:33:49.8615505Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:49.8615796Z 2025-05-07T20:33:49.8615887Z @given( 2025-05-07T20:33:49.8616113Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.8616438Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.8623975Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.8624457Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.8624803Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.8625103Z ) 2025-05-07T20:33:49.8625637Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.8626168Z def test_silu_mul_quant( 2025-05-07T20:33:49.8626416Z self, 2025-05-07T20:33:49.8626614Z T: int, 2025-05-07T20:33:49.8626813Z D: int, 2025-05-07T20:33:49.8627037Z scale_ub: Optional[float], 2025-05-07T20:33:49.8627325Z contiguous: bool, 2025-05-07T20:33:49.8627565Z compiled: bool, 2025-05-07T20:33:49.8627794Z ) -> None: 2025-05-07T20:33:49.8628018Z torch.manual_seed(2025) 2025-05-07T20:33:49.8628258Z 2025-05-07T20:33:49.8628538Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.8630863Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:49.8632882Z 2025-05-07T20:33:49.8633006Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:49.8633226Z 2025-05-07T20:33:49.8633346Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.8633776Z self=, 2025-05-07T20:33:49.8634209Z T=4096, 2025-05-07T20:33:49.8634411Z D=5120, 2025-05-07T20:33:49.8634608Z scale_ub=None, 2025-05-07T20:33:49.8634837Z contiguous=True, 2025-05-07T20:33:49.8635068Z compiled=False, 2025-05-07T20:33:49.8635281Z ) 2025-05-07T20:33:49.8635613Z self = 2025-05-07T20:33:49.8636135Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:49.8636422Z 2025-05-07T20:33:49.8636519Z @given( 2025-05-07T20:33:49.8636752Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.8637088Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.8637406Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.8637747Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.8638094Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.8638390Z ) 2025-05-07T20:33:49.8638746Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.8639210Z def test_silu_mul_quant( 2025-05-07T20:33:49.8639529Z self, 2025-05-07T20:33:49.8639729Z T: int, 2025-05-07T20:33:49.8639931Z D: int, 2025-05-07T20:33:49.8640155Z scale_ub: Optional[float], 2025-05-07T20:33:49.8640433Z contiguous: bool, 2025-05-07T20:33:49.8640686Z compiled: bool, 2025-05-07T20:33:49.8640919Z ) -> None: 2025-05-07T20:33:49.8641142Z torch.manual_seed(2025) 2025-05-07T20:33:49.8641385Z 2025-05-07T20:33:49.8641661Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.8643906Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:49.8645900Z 2025-05-07T20:33:49.8646029Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:49.8646247Z 2025-05-07T20:33:49.8646357Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.8646784Z self=, 2025-05-07T20:33:49.8647250Z T=2048, 2025-05-07T20:33:49.8647442Z D=5120, 2025-05-07T20:33:49.8647622Z scale_ub=None, 2025-05-07T20:33:49.8647839Z contiguous=False, 2025-05-07T20:33:49.8648069Z compiled=False, 2025-05-07T20:33:49.8648265Z ) 2025-05-07T20:33:49.8648587Z self = 2025-05-07T20:33:49.8649087Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:49.8649373Z 2025-05-07T20:33:49.8649452Z @given( 2025-05-07T20:33:49.8649691Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.8650015Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.8650329Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.8650671Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.8651057Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.8651359Z ) 2025-05-07T20:33:49.8651719Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.8652184Z def test_silu_mul_quant( 2025-05-07T20:33:49.8652437Z self, 2025-05-07T20:33:49.8652637Z T: int, 2025-05-07T20:33:49.8652846Z D: int, 2025-05-07T20:33:49.8653067Z scale_ub: Optional[float], 2025-05-07T20:33:49.8653342Z contiguous: bool, 2025-05-07T20:33:49.8653589Z compiled: bool, 2025-05-07T20:33:49.8653817Z ) -> None: 2025-05-07T20:33:49.8654028Z torch.manual_seed(2025) 2025-05-07T20:33:49.8654278Z 2025-05-07T20:33:49.8654672Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.8656862Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:49.8658858Z 2025-05-07T20:33:49.8658979Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:49.8659195Z 2025-05-07T20:33:49.8659344Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.8659787Z self=, 2025-05-07T20:33:49.8660201Z T=4096, 2025-05-07T20:33:49.8660440Z D=7168, 2025-05-07T20:33:49.8660626Z scale_ub=None, 2025-05-07T20:33:49.8660842Z contiguous=True, 2025-05-07T20:33:49.8661068Z compiled=True, 2025-05-07T20:33:49.8661264Z ) 2025-05-07T20:33:49.8661586Z self = 2025-05-07T20:33:49.8662096Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:49.8662376Z 2025-05-07T20:33:49.8662454Z @given( 2025-05-07T20:33:49.8662689Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.8663012Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.8663331Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.8663665Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.8664004Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.8664303Z ) 2025-05-07T20:33:49.8664658Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.8665126Z def test_silu_mul_quant( 2025-05-07T20:33:49.8665384Z self, 2025-05-07T20:33:49.8665629Z T: int, 2025-05-07T20:33:49.8665838Z D: int, 2025-05-07T20:33:49.8666064Z scale_ub: Optional[float], 2025-05-07T20:33:49.8666347Z contiguous: bool, 2025-05-07T20:33:49.8666597Z compiled: bool, 2025-05-07T20:33:49.8666870Z ) -> None: 2025-05-07T20:33:49.8667083Z torch.manual_seed(2025) 2025-05-07T20:33:49.8667326Z 2025-05-07T20:33:49.8667603Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.8669794Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:49.8671797Z 2025-05-07T20:33:49.8671919Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:49.8672178Z 2025-05-07T20:33:49.8672280Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.8672703Z self=, 2025-05-07T20:33:49.8673120Z T=2048, 2025-05-07T20:33:49.8673295Z D=5120, 2025-05-07T20:33:49.8673485Z scale_ub=1200.0, 2025-05-07T20:33:49.8673697Z contiguous=False, 2025-05-07T20:33:49.8673922Z compiled=False, 2025-05-07T20:33:49.8674119Z ) 2025-05-07T20:33:49.8674431Z self = 2025-05-07T20:33:49.8674934Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:49.8675218Z 2025-05-07T20:33:49.8675297Z @given( 2025-05-07T20:33:49.8675518Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.8675830Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.8676131Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.8676461Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.8676793Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.8677080Z ) 2025-05-07T20:33:49.8677428Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.8677882Z def test_silu_mul_quant( 2025-05-07T20:33:49.8678120Z self, 2025-05-07T20:33:49.8678303Z T: int, 2025-05-07T20:33:49.8678496Z D: int, 2025-05-07T20:33:49.8678711Z scale_ub: Optional[float], 2025-05-07T20:33:49.8678978Z contiguous: bool, 2025-05-07T20:33:49.8679215Z compiled: bool, 2025-05-07T20:33:49.8679431Z ) -> None: 2025-05-07T20:33:49.8679633Z torch.manual_seed(2025) 2025-05-07T20:33:49.8679926Z 2025-05-07T20:33:49.8680249Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.8682428Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:49.8684414Z 2025-05-07T20:33:49.8684534Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:49.8684748Z 2025-05-07T20:33:49.8684850Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.8685264Z self=, 2025-05-07T20:33:49.8685679Z T=4096, 2025-05-07T20:33:49.8685858Z D=7168, 2025-05-07T20:33:49.8686087Z scale_ub=1200.0, 2025-05-07T20:33:49.8686302Z contiguous=True, 2025-05-07T20:33:49.8686517Z compiled=False, 2025-05-07T20:33:49.8686721Z ) 2025-05-07T20:33:49.9756076Z self = 2025-05-07T20:33:49.9756783Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:49.9757078Z 2025-05-07T20:33:49.9757164Z @given( 2025-05-07T20:33:49.9757398Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.9757716Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.9758029Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.9758365Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.9758697Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.9758989Z ) 2025-05-07T20:33:49.9759341Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.9759791Z def test_silu_mul_quant( 2025-05-07T20:33:49.9760039Z self, 2025-05-07T20:33:49.9760238Z T: int, 2025-05-07T20:33:49.9760433Z D: int, 2025-05-07T20:33:49.9760743Z scale_ub: Optional[float], 2025-05-07T20:33:49.9761014Z contiguous: bool, 2025-05-07T20:33:49.9761242Z compiled: bool, 2025-05-07T20:33:49.9761466Z ) -> None: 2025-05-07T20:33:49.9761683Z torch.manual_seed(2025) 2025-05-07T20:33:49.9761928Z 2025-05-07T20:33:49.9762197Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.9764389Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:49.9766401Z 2025-05-07T20:33:49.9766520Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:49.9766737Z 2025-05-07T20:33:49.9766845Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.9767261Z self=, 2025-05-07T20:33:49.9767677Z T=16384, 2025-05-07T20:33:49.9767872Z D=7168, 2025-05-07T20:33:49.9768056Z scale_ub=None, 2025-05-07T20:33:49.9768285Z contiguous=False, 2025-05-07T20:33:49.9768513Z compiled=True, 2025-05-07T20:33:49.9768720Z ) 2025-05-07T20:33:49.9769044Z self = 2025-05-07T20:33:49.9769604Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:49.9769960Z 2025-05-07T20:33:49.9770038Z @given( 2025-05-07T20:33:49.9770265Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.9770574Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.9770882Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.9771211Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.9771547Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.9771831Z ) 2025-05-07T20:33:49.9772184Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.9772640Z def test_silu_mul_quant( 2025-05-07T20:33:49.9772873Z self, 2025-05-07T20:33:49.9773062Z T: int, 2025-05-07T20:33:49.9773256Z D: int, 2025-05-07T20:33:49.9773466Z scale_ub: Optional[float], 2025-05-07T20:33:49.9773740Z contiguous: bool, 2025-05-07T20:33:49.9773975Z compiled: bool, 2025-05-07T20:33:49.9774187Z ) -> None: 2025-05-07T20:33:49.9774400Z torch.manual_seed(2025) 2025-05-07T20:33:49.9774814Z 2025-05-07T20:33:49.9775083Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.9777265Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:49.9779312Z 2025-05-07T20:33:49.9779432Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:49.9779655Z 2025-05-07T20:33:49.9779758Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.9780183Z self=, 2025-05-07T20:33:49.9780600Z T=4096, 2025-05-07T20:33:49.9780792Z D=7168, 2025-05-07T20:33:49.9780980Z scale_ub=None, 2025-05-07T20:33:49.9781187Z contiguous=True, 2025-05-07T20:33:49.9781454Z compiled=False, 2025-05-07T20:33:49.9781656Z ) 2025-05-07T20:33:49.9781972Z self = 2025-05-07T20:33:49.9782483Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:49.9782767Z 2025-05-07T20:33:49.9782847Z @given( 2025-05-07T20:33:49.9783076Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.9783391Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.9783703Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.9784042Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.9784378Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.9784680Z ) 2025-05-07T20:33:49.9785037Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.9785487Z def test_silu_mul_quant( 2025-05-07T20:33:49.9785738Z self, 2025-05-07T20:33:49.9785939Z T: int, 2025-05-07T20:33:49.9786145Z D: int, 2025-05-07T20:33:49.9786370Z scale_ub: Optional[float], 2025-05-07T20:33:49.9786649Z contiguous: bool, 2025-05-07T20:33:49.9786892Z compiled: bool, 2025-05-07T20:33:49.9787108Z ) -> None: 2025-05-07T20:33:49.9787319Z torch.manual_seed(2025) 2025-05-07T20:33:49.9787562Z 2025-05-07T20:33:49.9787834Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.9790028Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:49.9792087Z 2025-05-07T20:33:49.9792210Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:49.9792431Z 2025-05-07T20:33:49.9792543Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.9792968Z self=, 2025-05-07T20:33:49.9793384Z T=16384, 2025-05-07T20:33:49.9793574Z D=7168, 2025-05-07T20:33:49.9793771Z scale_ub=None, 2025-05-07T20:33:49.9793988Z contiguous=True, 2025-05-07T20:33:49.9794222Z compiled=False, 2025-05-07T20:33:49.9794437Z ) 2025-05-07T20:33:49.9794758Z self = 2025-05-07T20:33:49.9795275Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:49.9795610Z 2025-05-07T20:33:49.9795697Z @given( 2025-05-07T20:33:49.9795930Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.9796257Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.9796575Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.9796958Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.9797290Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.9797580Z ) 2025-05-07T20:33:49.9797933Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.9798386Z def test_silu_mul_quant( 2025-05-07T20:33:49.9798630Z self, 2025-05-07T20:33:49.9798821Z T: int, 2025-05-07T20:33:49.9799015Z D: int, 2025-05-07T20:33:49.9799231Z scale_ub: Optional[float], 2025-05-07T20:33:49.9799535Z contiguous: bool, 2025-05-07T20:33:49.9799797Z compiled: bool, 2025-05-07T20:33:49.9800012Z ) -> None: 2025-05-07T20:33:49.9800222Z torch.manual_seed(2025) 2025-05-07T20:33:49.9800454Z 2025-05-07T20:33:49.9800721Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.9802944Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:49.9804944Z 2025-05-07T20:33:49.9805063Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:49.9805282Z 2025-05-07T20:33:49.9805388Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.9805809Z self=, 2025-05-07T20:33:49.9806228Z T=16384, 2025-05-07T20:33:49.9806417Z D=7168, 2025-05-07T20:33:49.9806599Z scale_ub=1200.0, 2025-05-07T20:33:49.9806818Z contiguous=True, 2025-05-07T20:33:49.9807032Z compiled=False, 2025-05-07T20:33:49.9807226Z ) 2025-05-07T20:33:49.9807540Z self = 2025-05-07T20:33:49.9808051Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:49.9808338Z 2025-05-07T20:33:49.9808420Z @given( 2025-05-07T20:33:49.9808638Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:49.9808951Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:49.9809257Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:49.9809583Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:49.9809957Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:49.9810244Z ) 2025-05-07T20:33:49.9810582Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:49.9811039Z def test_silu_mul_quant( 2025-05-07T20:33:49.9811272Z self, 2025-05-07T20:33:49.9811461Z T: int, 2025-05-07T20:33:49.9811645Z D: int, 2025-05-07T20:33:49.9811863Z scale_ub: Optional[float], 2025-05-07T20:33:49.9812130Z contiguous: bool, 2025-05-07T20:33:49.9812360Z compiled: bool, 2025-05-07T20:33:49.9812578Z ) -> None: 2025-05-07T20:33:49.9812786Z torch.manual_seed(2025) 2025-05-07T20:33:49.9813021Z 2025-05-07T20:33:49.9813286Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:49.9815597Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:49.9817630Z 2025-05-07T20:33:49.9817752Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:49.9817969Z 2025-05-07T20:33:49.9818073Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:49.9818490Z self=, 2025-05-07T20:33:49.9818910Z T=128, 2025-05-07T20:33:49.9819094Z D=5120, 2025-05-07T20:33:49.9819277Z scale_ub=1200.0, 2025-05-07T20:33:49.9819499Z contiguous=False, 2025-05-07T20:33:49.9819718Z compiled=False, 2025-05-07T20:33:49.9819913Z ) 2025-05-07T20:33:50.1122910Z self = 2025-05-07T20:33:50.1124039Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:50.1124619Z 2025-05-07T20:33:50.1124787Z @given( 2025-05-07T20:33:50.1125746Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.1126420Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.1127027Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.1127703Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.1128368Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.1128951Z ) 2025-05-07T20:33:50.1129643Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.1130286Z def test_silu_mul_quant( 2025-05-07T20:33:50.1130541Z self, 2025-05-07T20:33:50.1130733Z T: int, 2025-05-07T20:33:50.1130935Z D: int, 2025-05-07T20:33:50.1131159Z scale_ub: Optional[float], 2025-05-07T20:33:50.1131438Z contiguous: bool, 2025-05-07T20:33:50.1131690Z compiled: bool, 2025-05-07T20:33:50.1131917Z ) -> None: 2025-05-07T20:33:50.1132133Z torch.manual_seed(2025) 2025-05-07T20:33:50.1132377Z 2025-05-07T20:33:50.1132664Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.1133017Z 2025-05-07T20:33:50.1133215Z x_sign = torch.sign(x) 2025-05-07T20:33:50.1133514Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.1133826Z x = x_sign * x_clamp 2025-05-07T20:33:50.1134069Z x0 = x[:, :D] 2025-05-07T20:33:50.1134290Z x1 = x[:, D:] 2025-05-07T20:33:50.1134557Z 2025-05-07T20:33:50.1134744Z if contiguous: 2025-05-07T20:33:50.1134983Z x0 = x0.contiguous() 2025-05-07T20:33:50.1135259Z x1 = x1.contiguous() 2025-05-07T20:33:50.1135504Z 2025-05-07T20:33:50.1135706Z if scale_ub is not None: 2025-05-07T20:33:50.1136084Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.1136434Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.1136755Z ) 2025-05-07T20:33:50.1136962Z else: 2025-05-07T20:33:50.1137182Z scale_ub_tensor = None 2025-05-07T20:33:50.1137446Z 2025-05-07T20:33:50.1137698Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.1138025Z op = silu_mul_quant 2025-05-07T20:33:50.1138285Z if compiled: 2025-05-07T20:33:50.1138558Z op = torch.compile(op) 2025-05-07T20:33:50.1138860Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.1139158Z 2025-05-07T20:33:50.1139366Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.1139539Z 2025-05-07T20:33:50.1139664Z moe/activation_test.py:117: 2025-05-07T20:33:50.1139972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.1140328Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.1140631Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.1141438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.1142192Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.1142770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.1143577Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.1157861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.1158615Z kernel = self.compile( 2025-05-07T20:33:50.1159205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.1159988Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.1160428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.1160669Z 2025-05-07T20:33:50.1160893Z self = 2025-05-07T20:33:50.1162140Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.1163626Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1c9487c0>} 2025-05-07T20:33:50.1165047Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.1166144Z context = 2025-05-07T20:33:50.1166451Z 2025-05-07T20:33:50.1166637Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.1167185Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.1167692Z module_map=module_map) 2025-05-07T20:33:50.1168090Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.1168462Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.1168746Z E ^ 2025-05-07T20:33:50.1169244Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.1169737Z 2025-05-07T20:33:50.1170220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.1170768Z 2025-05-07T20:33:50.1170882Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.1171321Z self=, 2025-05-07T20:33:50.1171807Z T=2048, 2025-05-07T20:33:50.1172011Z D=7168, 2025-05-07T20:33:50.1172226Z scale_ub=None, 2025-05-07T20:33:50.1172464Z contiguous=False, 2025-05-07T20:33:50.1172702Z compiled=False, 2025-05-07T20:33:50.1172935Z ) 2025-05-07T20:33:50.1173278Z self = 2025-05-07T20:33:50.1173797Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:50.1174094Z 2025-05-07T20:33:50.1174180Z @given( 2025-05-07T20:33:50.1174430Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.1174863Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.1175188Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.1175544Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.1175944Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.1176285Z ) 2025-05-07T20:33:50.1176858Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.1181434Z def test_silu_mul_quant( 2025-05-07T20:33:50.1181684Z self, 2025-05-07T20:33:50.1181891Z T: int, 2025-05-07T20:33:50.1182103Z D: int, 2025-05-07T20:33:50.1182345Z scale_ub: Optional[float], 2025-05-07T20:33:50.1182690Z contiguous: bool, 2025-05-07T20:33:50.1182938Z compiled: bool, 2025-05-07T20:33:50.1183168Z ) -> None: 2025-05-07T20:33:50.1183388Z torch.manual_seed(2025) 2025-05-07T20:33:50.1183636Z 2025-05-07T20:33:50.1183916Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.1186161Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.1188161Z 2025-05-07T20:33:50.1188289Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:50.1188508Z 2025-05-07T20:33:50.1188617Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.1209714Z self=, 2025-05-07T20:33:50.1210152Z T=128, 2025-05-07T20:33:50.1210333Z D=7168, 2025-05-07T20:33:50.1210521Z scale_ub=1200.0, 2025-05-07T20:33:50.1210738Z contiguous=True, 2025-05-07T20:33:50.1210955Z compiled=True, 2025-05-07T20:33:50.1211152Z ) 2025-05-07T20:33:50.1484949Z self = 2025-05-07T20:33:50.1485470Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:50.1485772Z 2025-05-07T20:33:50.1485857Z @given( 2025-05-07T20:33:50.1486097Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.1486516Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.1486902Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.1487241Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.1487572Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.1487861Z ) 2025-05-07T20:33:50.1488205Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.1488653Z def test_silu_mul_quant( 2025-05-07T20:33:50.1488888Z self, 2025-05-07T20:33:50.1489076Z T: int, 2025-05-07T20:33:50.1489261Z D: int, 2025-05-07T20:33:50.1489469Z scale_ub: Optional[float], 2025-05-07T20:33:50.1489785Z contiguous: bool, 2025-05-07T20:33:50.1490019Z compiled: bool, 2025-05-07T20:33:50.1490373Z ) -> None: 2025-05-07T20:33:50.1490588Z torch.manual_seed(2025) 2025-05-07T20:33:50.1490836Z 2025-05-07T20:33:50.1491114Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.1491460Z 2025-05-07T20:33:50.1491657Z x_sign = torch.sign(x) 2025-05-07T20:33:50.1491961Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.1492273Z x = x_sign * x_clamp 2025-05-07T20:33:50.1492521Z x0 = x[:, :D] 2025-05-07T20:33:50.1492744Z x1 = x[:, D:] 2025-05-07T20:33:50.1492954Z 2025-05-07T20:33:50.1493138Z if contiguous: 2025-05-07T20:33:50.1493378Z x0 = x0.contiguous() 2025-05-07T20:33:50.1493646Z x1 = x1.contiguous() 2025-05-07T20:33:50.1493895Z 2025-05-07T20:33:50.1494096Z if scale_ub is not None: 2025-05-07T20:33:50.1494383Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.1494824Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.1495148Z ) 2025-05-07T20:33:50.1495424Z else: 2025-05-07T20:33:50.1495635Z scale_ub_tensor = None 2025-05-07T20:33:50.1495896Z 2025-05-07T20:33:50.1496135Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.1496456Z op = silu_mul_quant 2025-05-07T20:33:50.1496788Z if compiled: 2025-05-07T20:33:50.1497049Z op = torch.compile(op) 2025-05-07T20:33:50.1497354Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.1497649Z 2025-05-07T20:33:50.1497851Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.1498021Z 2025-05-07T20:33:50.1498131Z moe/activation_test.py:117: 2025-05-07T20:33:50.1498433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.1498785Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.1499088Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.1499673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.1500277Z return fn(*args, **kwargs) 2025-05-07T20:33:50.1501056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.1501796Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.1502364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.1503091Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.1503802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.1504366Z kernel = self.compile( 2025-05-07T20:33:50.1504941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.1505650Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.1506085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.1506324Z 2025-05-07T20:33:50.1506543Z self = 2025-05-07T20:33:50.1507678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.1509120Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1c949940>} 2025-05-07T20:33:50.1510541Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.1511682Z context = 2025-05-07T20:33:50.1511998Z 2025-05-07T20:33:50.1512172Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.1512723Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.1513219Z module_map=module_map) 2025-05-07T20:33:50.1513593Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.1513963Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.1514239Z E ^ 2025-05-07T20:33:50.1514726Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.1515213Z 2025-05-07T20:33:50.1515653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.1516207Z 2025-05-07T20:33:50.1516318Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.1516805Z self=, 2025-05-07T20:33:50.1517221Z T=128, 2025-05-07T20:33:50.1517427Z D=7168, 2025-05-07T20:33:50.1517623Z scale_ub=1200.0, 2025-05-07T20:33:50.1517843Z contiguous=True, 2025-05-07T20:33:50.1518078Z compiled=False, 2025-05-07T20:33:50.1518330Z ) 2025-05-07T20:33:50.1518659Z self = 2025-05-07T20:33:50.1519174Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:50.1519466Z 2025-05-07T20:33:50.1519551Z @given( 2025-05-07T20:33:50.1519786Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.1520101Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.1520424Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.1520763Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.1521098Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.1521395Z ) 2025-05-07T20:33:50.1521751Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.1522211Z def test_silu_mul_quant( 2025-05-07T20:33:50.1522446Z self, 2025-05-07T20:33:50.1522697Z T: int, 2025-05-07T20:33:50.1522902Z D: int, 2025-05-07T20:33:50.1523125Z scale_ub: Optional[float], 2025-05-07T20:33:50.1523401Z contiguous: bool, 2025-05-07T20:33:50.1523645Z compiled: bool, 2025-05-07T20:33:50.1523863Z ) -> None: 2025-05-07T20:33:50.1524089Z torch.manual_seed(2025) 2025-05-07T20:33:50.1524339Z 2025-05-07T20:33:50.1524613Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.1524975Z 2025-05-07T20:33:50.1525172Z x_sign = torch.sign(x) 2025-05-07T20:33:50.1525704Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.1527880Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.1529879Z 2025-05-07T20:33:50.1529997Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:50.1530222Z 2025-05-07T20:33:50.1530325Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.1530750Z self=, 2025-05-07T20:33:50.1531160Z T=128, 2025-05-07T20:33:50.1531354Z D=5120, 2025-05-07T20:33:50.1531552Z scale_ub=1200.0, 2025-05-07T20:33:50.1531774Z contiguous=True, 2025-05-07T20:33:50.1532088Z compiled=True, 2025-05-07T20:33:50.1532290Z ) 2025-05-07T20:33:50.1532613Z self = 2025-05-07T20:33:50.1533124Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:50.1533415Z 2025-05-07T20:33:50.1533494Z @given( 2025-05-07T20:33:50.1533736Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.1534047Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.1534360Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.1534748Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.1535078Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.1535372Z ) 2025-05-07T20:33:50.1535824Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.1536298Z def test_silu_mul_quant( 2025-05-07T20:33:50.1536544Z self, 2025-05-07T20:33:50.1536760Z T: int, 2025-05-07T20:33:50.1536952Z D: int, 2025-05-07T20:33:50.1537285Z scale_ub: Optional[float], 2025-05-07T20:33:50.1537566Z contiguous: bool, 2025-05-07T20:33:50.1537803Z compiled: bool, 2025-05-07T20:33:50.1538035Z ) -> None: 2025-05-07T20:33:50.1538254Z torch.manual_seed(2025) 2025-05-07T20:33:50.1538567Z 2025-05-07T20:33:50.1538839Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.1539193Z 2025-05-07T20:33:50.1539395Z x_sign = torch.sign(x) 2025-05-07T20:33:50.1539685Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.1541815Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.1543798Z 2025-05-07T20:33:50.1543978Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:50.1544205Z 2025-05-07T20:33:50.1544319Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.1544749Z self=, 2025-05-07T20:33:50.1545162Z T=128, 2025-05-07T20:33:50.1545364Z D=7168, 2025-05-07T20:33:50.1545567Z scale_ub=None, 2025-05-07T20:33:50.1545775Z contiguous=True, 2025-05-07T20:33:50.1546004Z compiled=True, 2025-05-07T20:33:50.1546212Z ) 2025-05-07T20:33:50.4242448Z self = 2025-05-07T20:33:50.4242983Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:50.4243273Z 2025-05-07T20:33:50.4243352Z @given( 2025-05-07T20:33:50.4245003Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4245317Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4245632Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4245973Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4246304Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4246598Z ) 2025-05-07T20:33:50.4246953Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4247410Z def test_silu_mul_quant( 2025-05-07T20:33:50.4247647Z self, 2025-05-07T20:33:50.4247841Z T: int, 2025-05-07T20:33:50.4248035Z D: int, 2025-05-07T20:33:50.4248243Z scale_ub: Optional[float], 2025-05-07T20:33:50.4248514Z contiguous: bool, 2025-05-07T20:33:50.4248753Z compiled: bool, 2025-05-07T20:33:50.4248966Z ) -> None: 2025-05-07T20:33:50.4249294Z torch.manual_seed(2025) 2025-05-07T20:33:50.4249535Z 2025-05-07T20:33:50.4249814Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4252053Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.4254051Z 2025-05-07T20:33:50.4254169Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:50.4254391Z 2025-05-07T20:33:50.4269993Z FAILED 2025-05-07T20:33:50.4270150Z 2025-05-07T20:33:50.4270294Z =================================== FAILURES =================================== 2025-05-07T20:33:50.4271117Z _____________________ ActivationTests.test_silu_mul_quant ______________________ 2025-05-07T20:33:50.4271687Z + Exception Group Traceback (most recent call last): 2025-05-07T20:33:50.4272357Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 58, in testPartExecutor 2025-05-07T20:33:50.4273019Z | yield 2025-05-07T20:33:50.4273480Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 634, in run 2025-05-07T20:33:50.4274105Z | self._callTestMethod(testMethod) 2025-05-07T20:33:50.4274771Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/unittest/case.py", line 589, in _callTestMethod 2025-05-07T20:33:50.4275459Z | if method() is not None: 2025-05-07T20:33:50.4275754Z | ^^^^^^^^ 2025-05-07T20:33:50.4276501Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant 2025-05-07T20:33:50.4277552Z | T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4277978Z | ^^^^^^^ 2025-05-07T20:33:50.4278857Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/hypothesis/core.py", line 1850, in wrapped_test 2025-05-07T20:33:50.4279763Z | raise the_error_hypothesis_found 2025-05-07T20:33:50.4280414Z | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions) 2025-05-07T20:33:50.4281021Z +-+---------------- 1 ---------------- 2025-05-07T20:33:50.4281421Z | Traceback (most recent call last): 2025-05-07T20:33:50.4282428Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:50.4283532Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4284055Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:50.4286944Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.4289792Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:50.4290414Z | self=, 2025-05-07T20:33:50.4290985Z | T=2048, 2025-05-07T20:33:50.4291299Z | D=5120, # or any other generated value 2025-05-07T20:33:50.4291779Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:50.4292380Z | contiguous=True, # or any other generated value 2025-05-07T20:33:50.4292901Z | compiled=False, # or any other generated value 2025-05-07T20:33:50.4293326Z | ) 2025-05-07T20:33:50.4293582Z | 2025-05-07T20:33:50.4294326Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case 2025-05-07T20:33:50.4295319Z +---------------- 2 ---------------- 2025-05-07T20:33:50.4295730Z | Traceback (most recent call last): 2025-05-07T20:33:50.4296763Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:50.4297889Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4298410Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:50.4301381Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.4304282Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:50.4304879Z | self=, 2025-05-07T20:33:50.4305296Z | T=128, 2025-05-07T20:33:50.4305496Z | D=7168, 2025-05-07T20:33:50.4305706Z | scale_ub=None, 2025-05-07T20:33:50.4305946Z | contiguous=True, 2025-05-07T20:33:50.4306182Z | compiled=True, 2025-05-07T20:33:50.4306410Z | ) 2025-05-07T20:33:50.4306591Z | 2025-05-07T20:33:50.4307127Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:50.4307814Z +---------------- 3 ---------------- 2025-05-07T20:33:50.4308121Z | Traceback (most recent call last): 2025-05-07T20:33:50.4308860Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant 2025-05-07T20:33:50.4309678Z | x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4310069Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:50.4312290Z | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.4314404Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:50.4314853Z | self=, 2025-05-07T20:33:50.4315277Z | T=128, 2025-05-07T20:33:50.4315478Z | D=5120, 2025-05-07T20:33:50.4315683Z | scale_ub=1200.0, 2025-05-07T20:33:50.4315927Z | contiguous=True, 2025-05-07T20:33:50.4316170Z | compiled=True, 2025-05-07T20:33:50.4316395Z | ) 2025-05-07T20:33:50.4316584Z | 2025-05-07T20:33:50.4317128Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case 2025-05-07T20:33:50.4317821Z +---------------- 4 ---------------- 2025-05-07T20:33:50.4318117Z | Traceback (most recent call last): 2025-05-07T20:33:50.4318869Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant 2025-05-07T20:33:50.4319623Z | y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:50.4319906Z | ^^^^^^^^ 2025-05-07T20:33:50.4320728Z | File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn 2025-05-07T20:33:50.4321767Z | return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4322256Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:50.4323425Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row 2025-05-07T20:33:50.4324751Z | _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:50.4325938Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 330, in 2025-05-07T20:33:50.4327011Z | return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4327827Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:50.4328786Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 186, in run 2025-05-07T20:33:50.4352232Z | timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:50.4353052Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:50.4353995Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 166, in _bench 2025-05-07T20:33:50.4355027Z | return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:50.4355573Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:50.4356651Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py", line 117, in do_bench 2025-05-07T20:33:50.4358248Z | fn() 2025-05-07T20:33:50.4359075Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call 2025-05-07T20:33:50.4360338Z | self.fn.run( 2025-05-07T20:33:50.4361096Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py", line 623, in run 2025-05-07T20:33:50.4361947Z | kernel = self.compile( 2025-05-07T20:33:50.4362322Z | ^^^^^^^^^^^^^ 2025-05-07T20:33:50.4363171Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 273, in compile 2025-05-07T20:33:50.4364208Z | module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4364768Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:50.4365708Z | File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py", line 100, in make_ir 2025-05-07T20:33:50.4366858Z | return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4367545Z | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-05-07T20:33:50.4368082Z | triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4368579Z | def _kernel_quantize_fp8_row( 2025-05-07T20:33:50.4368967Z | ^ 2025-05-07T20:33:50.4369638Z | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4370620Z | Falsifying example: test_silu_mul_quant( 2025-05-07T20:33:50.4371177Z | # The test always failed when commented parts were varied together. 2025-05-07T20:33:50.4371902Z | self=, 2025-05-07T20:33:50.4372517Z | T=1, # or any other generated value 2025-05-07T20:33:50.4372949Z | D=5120, # or any other generated value 2025-05-07T20:33:50.4373442Z | scale_ub=None, # or any other generated value 2025-05-07T20:33:50.4373989Z | contiguous=True, # or any other generated value 2025-05-07T20:33:50.4374671Z | compiled=True, # or any other generated value 2025-05-07T20:33:50.4375107Z | ) 2025-05-07T20:33:50.4375362Z | 2025-05-07T20:33:50.4376119Z | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case 2025-05-07T20:33:50.4377007Z +------------------------------------ 2025-05-07T20:33:50.4377618Z ---------------------------------- Hypothesis ---------------------------------- 2025-05-07T20:33:50.4378163Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4378748Z self=, 2025-05-07T20:33:50.4379400Z T=1, 2025-05-07T20:33:50.4379658Z D=5120, 2025-05-07T20:33:50.4379915Z scale_ub=None, 2025-05-07T20:33:50.4380210Z contiguous=True, 2025-05-07T20:33:50.4380513Z compiled=True, 2025-05-07T20:33:50.4380796Z ) 2025-05-07T20:33:50.4381225Z self = 2025-05-07T20:33:50.4381881Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:50.4382235Z 2025-05-07T20:33:50.4382355Z @given( 2025-05-07T20:33:50.4382665Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4383111Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4383548Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4384017Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4384498Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4384919Z ) 2025-05-07T20:33:50.4385468Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4386125Z def test_silu_mul_quant( 2025-05-07T20:33:50.4386467Z self, 2025-05-07T20:33:50.4386738Z T: int, 2025-05-07T20:33:50.4387002Z D: int, 2025-05-07T20:33:50.4387301Z scale_ub: Optional[float], 2025-05-07T20:33:50.4387679Z contiguous: bool, 2025-05-07T20:33:50.4388014Z compiled: bool, 2025-05-07T20:33:50.4388340Z ) -> None: 2025-05-07T20:33:50.4388646Z torch.manual_seed(2025) 2025-05-07T20:33:50.4388978Z 2025-05-07T20:33:50.4389348Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4389846Z 2025-05-07T20:33:50.4390123Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4390548Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4390999Z x = x_sign * x_clamp 2025-05-07T20:33:50.4391330Z x0 = x[:, :D] 2025-05-07T20:33:50.4391636Z x1 = x[:, D:] 2025-05-07T20:33:50.4391926Z 2025-05-07T20:33:50.4392189Z if contiguous: 2025-05-07T20:33:50.4392508Z x0 = x0.contiguous() 2025-05-07T20:33:50.4392866Z x1 = x1.contiguous() 2025-05-07T20:33:50.4393198Z 2025-05-07T20:33:50.4393464Z if scale_ub is not None: 2025-05-07T20:33:50.4393854Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4394314Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4394711Z ) 2025-05-07T20:33:50.4394973Z else: 2025-05-07T20:33:50.4395260Z scale_ub_tensor = None 2025-05-07T20:33:50.4395587Z 2025-05-07T20:33:50.4395892Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4396378Z op = silu_mul_quant 2025-05-07T20:33:50.4396708Z if compiled: 2025-05-07T20:33:50.4397037Z op = torch.compile(op) 2025-05-07T20:33:50.4397429Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4397795Z 2025-05-07T20:33:50.4398057Z y_fp8, y_scale = fn() 2025-05-07T20:33:50.4398436Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:50.4398843Z 2025-05-07T20:33:50.4399184Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4399642Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:50.4400041Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:50.4400459Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:50.4400943Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4401368Z 2025-05-07T20:33:50.4401637Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:50.4401921Z 2025-05-07T20:33:50.4402058Z moe/activation_test.py:126: 2025-05-07T20:33:50.4402508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4402952Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:50.4403396Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4404571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:50.4405696Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:50.4406489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4407446Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4408440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:50.4409528Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:50.4410576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:50.4411525Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:50.4412352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:50.4413060Z fn() 2025-05-07T20:33:50.4413761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:50.4414663Z self.fn.run( 2025-05-07T20:33:50.4415331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4416060Z kernel = self.compile( 2025-05-07T20:33:50.4416828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4417786Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4418371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4418725Z 2025-05-07T20:33:50.4419022Z self = 2025-05-07T20:33:50.4420661Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4422698Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c0b065c60>} 2025-05-07T20:33:50.4424678Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4426581Z context = 2025-05-07T20:33:50.4427011Z 2025-05-07T20:33:50.4427246Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4428003Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4428630Z module_map=module_map) 2025-05-07T20:33:50.4429106Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4429604Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:50.4429975Z E ^ 2025-05-07T20:33:50.4430614Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4431247Z 2025-05-07T20:33:50.4431812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4432563Z 2025-05-07T20:33:50.4432707Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4433386Z self=, 2025-05-07T20:33:50.4433952Z T=2048, 2025-05-07T20:33:50.4434218Z D=5120, 2025-05-07T20:33:50.4434496Z scale_ub=1200.0, 2025-05-07T20:33:50.4434808Z contiguous=True, 2025-05-07T20:33:50.4435221Z compiled=False, 2025-05-07T20:33:50.4435518Z ) 2025-05-07T20:33:50.4435964Z self = 2025-05-07T20:33:50.4436678Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:50.4437086Z 2025-05-07T20:33:50.4437199Z @given( 2025-05-07T20:33:50.4437529Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4437964Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4438401Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4438882Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4439370Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4439798Z ) 2025-05-07T20:33:50.4440308Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4441031Z def test_silu_mul_quant( 2025-05-07T20:33:50.4441387Z self, 2025-05-07T20:33:50.4441675Z T: int, 2025-05-07T20:33:50.4441956Z D: int, 2025-05-07T20:33:50.4442287Z scale_ub: Optional[float], 2025-05-07T20:33:50.4442686Z contiguous: bool, 2025-05-07T20:33:50.4443024Z compiled: bool, 2025-05-07T20:33:50.4443351Z ) -> None: 2025-05-07T20:33:50.4443670Z torch.manual_seed(2025) 2025-05-07T20:33:50.4444041Z 2025-05-07T20:33:50.4444422Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4444929Z 2025-05-07T20:33:50.4445212Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4445621Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4446079Z x = x_sign * x_clamp 2025-05-07T20:33:50.4446430Z x0 = x[:, :D] 2025-05-07T20:33:50.4446737Z x1 = x[:, D:] 2025-05-07T20:33:50.4447046Z 2025-05-07T20:33:50.4447319Z if contiguous: 2025-05-07T20:33:50.4447655Z x0 = x0.contiguous() 2025-05-07T20:33:50.4448041Z x1 = x1.contiguous() 2025-05-07T20:33:50.4448403Z 2025-05-07T20:33:50.4448678Z if scale_ub is not None: 2025-05-07T20:33:50.4449073Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4449564Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4450008Z ) 2025-05-07T20:33:50.4450296Z else: 2025-05-07T20:33:50.4451048Z scale_ub_tensor = None 2025-05-07T20:33:50.4451408Z 2025-05-07T20:33:50.4451733Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4452176Z op = silu_mul_quant 2025-05-07T20:33:50.4452515Z if compiled: 2025-05-07T20:33:50.4452957Z op = torch.compile(op) 2025-05-07T20:33:50.4453371Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4453767Z 2025-05-07T20:33:50.4454022Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.4454241Z 2025-05-07T20:33:50.4454385Z moe/activation_test.py:117: 2025-05-07T20:33:50.4454901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4455360Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.4455752Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4456728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.4457687Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.4458412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4459346Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4460398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4461700Z kernel = self.compile( 2025-05-07T20:33:50.4462695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4463703Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4464274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4464610Z 2025-05-07T20:33:50.4464897Z self = 2025-05-07T20:33:50.4466446Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4468459Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c0aebc220>} 2025-05-07T20:33:50.4470580Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4472087Z context = 2025-05-07T20:33:50.4472504Z 2025-05-07T20:33:50.4472747Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4473500Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4474155Z module_map=module_map) 2025-05-07T20:33:50.4474658Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4475161Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.4475510Z E ^ 2025-05-07T20:33:50.4476164Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4476822Z 2025-05-07T20:33:50.4477432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4478149Z 2025-05-07T20:33:50.4478298Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4478841Z self=, 2025-05-07T20:33:50.4479384Z T=2048, 2025-05-07T20:33:50.4479635Z D=5120, 2025-05-07T20:33:50.4479887Z scale_ub=1200.0, 2025-05-07T20:33:50.4480199Z contiguous=True, 2025-05-07T20:33:50.4480494Z compiled=True, 2025-05-07T20:33:50.4480761Z ) 2025-05-07T20:33:50.4481182Z self = 2025-05-07T20:33:50.4481838Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:50.4482295Z 2025-05-07T20:33:50.4482424Z @given( 2025-05-07T20:33:50.4482756Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4483218Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4483674Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4484130Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4484577Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4484961Z ) 2025-05-07T20:33:50.4485419Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4486023Z def test_silu_mul_quant( 2025-05-07T20:33:50.4486363Z self, 2025-05-07T20:33:50.4486623Z T: int, 2025-05-07T20:33:50.4486884Z D: int, 2025-05-07T20:33:50.4487172Z scale_ub: Optional[float], 2025-05-07T20:33:50.4487526Z contiguous: bool, 2025-05-07T20:33:50.4487848Z compiled: bool, 2025-05-07T20:33:50.4488142Z ) -> None: 2025-05-07T20:33:50.4488414Z torch.manual_seed(2025) 2025-05-07T20:33:50.4488744Z 2025-05-07T20:33:50.4489166Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4489652Z 2025-05-07T20:33:50.4489939Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4490337Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4490807Z x = x_sign * x_clamp 2025-05-07T20:33:50.4491118Z x0 = x[:, :D] 2025-05-07T20:33:50.4491410Z x1 = x[:, D:] 2025-05-07T20:33:50.4491688Z 2025-05-07T20:33:50.4491932Z if contiguous: 2025-05-07T20:33:50.4492253Z x0 = x0.contiguous() 2025-05-07T20:33:50.4492607Z x1 = x1.contiguous() 2025-05-07T20:33:50.4492916Z 2025-05-07T20:33:50.4493173Z if scale_ub is not None: 2025-05-07T20:33:50.4493543Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4493988Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4494405Z ) 2025-05-07T20:33:50.4494751Z else: 2025-05-07T20:33:50.4495025Z scale_ub_tensor = None 2025-05-07T20:33:50.4495372Z 2025-05-07T20:33:50.4495683Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4496107Z op = silu_mul_quant 2025-05-07T20:33:50.4496489Z if compiled: 2025-05-07T20:33:50.4496822Z op = torch.compile(op) 2025-05-07T20:33:50.4497225Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4497597Z 2025-05-07T20:33:50.4497856Z y_fp8, y_scale = fn() 2025-05-07T20:33:50.4498235Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:50.4498619Z 2025-05-07T20:33:50.4498936Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4499388Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:50.4499772Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:50.4500259Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:50.4500777Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4501250Z 2025-05-07T20:33:50.4501540Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:50.4501847Z 2025-05-07T20:33:50.4501994Z moe/activation_test.py:126: 2025-05-07T20:33:50.4502443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4502934Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:50.4503403Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4504556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:50.4505664Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:50.4506462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4507454Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4508503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:50.4509551Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:50.4510611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:50.4511542Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:50.4512420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:50.4513159Z fn() 2025-05-07T20:33:50.4513875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:50.4514761Z self.fn.run( 2025-05-07T20:33:50.4515446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4516263Z kernel = self.compile( 2025-05-07T20:33:50.4517102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4518054Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4518613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4519003Z 2025-05-07T20:33:50.4519294Z self = 2025-05-07T20:33:50.4520902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4522904Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c0aebd8a0>} 2025-05-07T20:33:50.4524865Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4526752Z context = 2025-05-07T20:33:50.4527161Z 2025-05-07T20:33:50.4527408Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4528202Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4528902Z module_map=module_map) 2025-05-07T20:33:50.4529436Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4529946Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:50.4530328Z E ^ 2025-05-07T20:33:50.4530982Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4531652Z 2025-05-07T20:33:50.4532260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4532999Z 2025-05-07T20:33:50.4533151Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4533727Z self=, 2025-05-07T20:33:50.4534289Z T=16384, 2025-05-07T20:33:50.4534638Z D=7168, 2025-05-07T20:33:50.4534910Z scale_ub=1200.0, 2025-05-07T20:33:50.4535221Z contiguous=False, 2025-05-07T20:33:50.4535547Z compiled=False, 2025-05-07T20:33:50.4535845Z ) 2025-05-07T20:33:50.4536291Z self = 2025-05-07T20:33:50.4537004Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:50.4537402Z 2025-05-07T20:33:50.4537515Z @given( 2025-05-07T20:33:50.4537821Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4538354Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4538791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4539243Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4539700Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4540155Z ) 2025-05-07T20:33:50.4540642Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4541276Z def test_silu_mul_quant( 2025-05-07T20:33:50.4541633Z self, 2025-05-07T20:33:50.4541920Z T: int, 2025-05-07T20:33:50.4542201Z D: int, 2025-05-07T20:33:50.4542520Z scale_ub: Optional[float], 2025-05-07T20:33:50.4542913Z contiguous: bool, 2025-05-07T20:33:50.4543252Z compiled: bool, 2025-05-07T20:33:50.4543564Z ) -> None: 2025-05-07T20:33:50.4543865Z torch.manual_seed(2025) 2025-05-07T20:33:50.4544198Z 2025-05-07T20:33:50.4544571Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4545061Z 2025-05-07T20:33:50.4545405Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4545805Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4546236Z x = x_sign * x_clamp 2025-05-07T20:33:50.4546576Z x0 = x[:, :D] 2025-05-07T20:33:50.4546876Z x1 = x[:, D:] 2025-05-07T20:33:50.4547250Z 2025-05-07T20:33:50.4547510Z if contiguous: 2025-05-07T20:33:50.4547827Z x0 = x0.contiguous() 2025-05-07T20:33:50.4548189Z x1 = x1.contiguous() 2025-05-07T20:33:50.4548529Z 2025-05-07T20:33:50.4548791Z if scale_ub is not None: 2025-05-07T20:33:50.4549175Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4549640Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4550062Z ) 2025-05-07T20:33:50.4550330Z else: 2025-05-07T20:33:50.4550615Z scale_ub_tensor = None 2025-05-07T20:33:50.4550960Z 2025-05-07T20:33:50.4551282Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4551721Z op = silu_mul_quant 2025-05-07T20:33:50.4552060Z if compiled: 2025-05-07T20:33:50.4552410Z op = torch.compile(op) 2025-05-07T20:33:50.4552881Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4553260Z 2025-05-07T20:33:50.4553531Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.4553767Z 2025-05-07T20:33:50.4553905Z moe/activation_test.py:117: 2025-05-07T20:33:50.4554320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4554801Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.4555212Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4556229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.4557220Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.4557987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4558978Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4559957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4571197Z kernel = self.compile( 2025-05-07T20:33:50.4572020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4572966Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4573528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4573861Z 2025-05-07T20:33:50.4574155Z self = 2025-05-07T20:33:50.4575823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4577853Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c09d487c0>} 2025-05-07T20:33:50.4579783Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4581286Z context = 2025-05-07T20:33:50.4581685Z 2025-05-07T20:33:50.4581915Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4582559Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4583051Z module_map=module_map) 2025-05-07T20:33:50.4583436Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4583867Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.4584145Z E ^ 2025-05-07T20:33:50.4584635Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4585183Z 2025-05-07T20:33:50.4585629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4586174Z 2025-05-07T20:33:50.4586284Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4586722Z self=, 2025-05-07T20:33:50.4587151Z T=1, 2025-05-07T20:33:50.4587344Z D=7168, 2025-05-07T20:33:50.4587541Z scale_ub=None, 2025-05-07T20:33:50.4587764Z contiguous=True, 2025-05-07T20:33:50.4587998Z compiled=True, 2025-05-07T20:33:50.4588202Z ) 2025-05-07T20:33:50.4588538Z self = 2025-05-07T20:33:50.4589051Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:50.4589321Z 2025-05-07T20:33:50.4589402Z @given( 2025-05-07T20:33:50.4589695Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4590020Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4590333Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4590682Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4591026Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4591330Z ) 2025-05-07T20:33:50.4591684Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4592152Z def test_silu_mul_quant( 2025-05-07T20:33:50.4592406Z self, 2025-05-07T20:33:50.4592598Z T: int, 2025-05-07T20:33:50.4592807Z D: int, 2025-05-07T20:33:50.4593037Z scale_ub: Optional[float], 2025-05-07T20:33:50.4593317Z contiguous: bool, 2025-05-07T20:33:50.4593575Z compiled: bool, 2025-05-07T20:33:50.4593805Z ) -> None: 2025-05-07T20:33:50.4594016Z torch.manual_seed(2025) 2025-05-07T20:33:50.4594263Z 2025-05-07T20:33:50.4594550Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4594903Z 2025-05-07T20:33:50.4595104Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4595400Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4595711Z x = x_sign * x_clamp 2025-05-07T20:33:50.4595955Z x0 = x[:, :D] 2025-05-07T20:33:50.4596175Z x1 = x[:, D:] 2025-05-07T20:33:50.4596379Z 2025-05-07T20:33:50.4596570Z if contiguous: 2025-05-07T20:33:50.4596810Z x0 = x0.contiguous() 2025-05-07T20:33:50.4597075Z x1 = x1.contiguous() 2025-05-07T20:33:50.4597315Z 2025-05-07T20:33:50.4597512Z if scale_ub is not None: 2025-05-07T20:33:50.4597846Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4598183Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4598502Z ) 2025-05-07T20:33:50.4598704Z else: 2025-05-07T20:33:50.4598911Z scale_ub_tensor = None 2025-05-07T20:33:50.4599172Z 2025-05-07T20:33:50.4599408Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4599732Z op = silu_mul_quant 2025-05-07T20:33:50.4599991Z if compiled: 2025-05-07T20:33:50.4600243Z op = torch.compile(op) 2025-05-07T20:33:50.4600541Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4600822Z 2025-05-07T20:33:50.4601016Z y_fp8, y_scale = fn() 2025-05-07T20:33:50.4601294Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:50.4601594Z 2025-05-07T20:33:50.4601836Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4602184Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:50.4602486Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:50.4602856Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:50.4603233Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4603550Z 2025-05-07T20:33:50.4603756Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:50.4604006Z 2025-05-07T20:33:50.4604114Z moe/activation_test.py:126: 2025-05-07T20:33:50.4604410Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4604765Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:50.4605109Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4605942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:50.4606733Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:50.4607310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4608040Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4608817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:50.4609586Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:50.4610363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:50.4611044Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:50.4611675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:50.4612230Z fn() 2025-05-07T20:33:50.4612767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:50.4613392Z self.fn.run( 2025-05-07T20:33:50.4613877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4614436Z kernel = self.compile( 2025-05-07T20:33:50.4615132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4615819Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4616233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4616473Z 2025-05-07T20:33:50.4616685Z self = 2025-05-07T20:33:50.4617813Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4619309Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c09d7a840>} 2025-05-07T20:33:50.4620723Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4621825Z context = 2025-05-07T20:33:50.4622129Z 2025-05-07T20:33:50.4622308Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4622855Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4623338Z module_map=module_map) 2025-05-07T20:33:50.4623717Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4624095Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:50.4624376Z E ^ 2025-05-07T20:33:50.4624908Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4625581Z 2025-05-07T20:33:50.4626088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4626741Z 2025-05-07T20:33:50.4626861Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4627289Z self=, 2025-05-07T20:33:50.4627716Z T=4096, 2025-05-07T20:33:50.4627914Z D=5120, 2025-05-07T20:33:50.4628107Z scale_ub=None, 2025-05-07T20:33:50.4628329Z contiguous=False, 2025-05-07T20:33:50.4628563Z compiled=False, 2025-05-07T20:33:50.4628768Z ) 2025-05-07T20:33:50.4629099Z self = 2025-05-07T20:33:50.4629619Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:50.4629908Z 2025-05-07T20:33:50.4629997Z @given( 2025-05-07T20:33:50.4630228Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4630553Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4630944Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4631282Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4631627Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4631928Z ) 2025-05-07T20:33:50.4632283Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4632748Z def test_silu_mul_quant( 2025-05-07T20:33:50.4633000Z self, 2025-05-07T20:33:50.4633202Z T: int, 2025-05-07T20:33:50.4633406Z D: int, 2025-05-07T20:33:50.4633626Z scale_ub: Optional[float], 2025-05-07T20:33:50.4633907Z contiguous: bool, 2025-05-07T20:33:50.4634144Z compiled: bool, 2025-05-07T20:33:50.4634372Z ) -> None: 2025-05-07T20:33:50.4634595Z torch.manual_seed(2025) 2025-05-07T20:33:50.4634830Z 2025-05-07T20:33:50.4635112Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4635470Z 2025-05-07T20:33:50.4635660Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4635957Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4636278Z x = x_sign * x_clamp 2025-05-07T20:33:50.4636513Z x0 = x[:, :D] 2025-05-07T20:33:50.4636734Z x1 = x[:, D:] 2025-05-07T20:33:50.4636950Z 2025-05-07T20:33:50.4637130Z if contiguous: 2025-05-07T20:33:50.4637361Z x0 = x0.contiguous() 2025-05-07T20:33:50.4637625Z x1 = x1.contiguous() 2025-05-07T20:33:50.4637865Z 2025-05-07T20:33:50.4638059Z if scale_ub is not None: 2025-05-07T20:33:50.4638336Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4638671Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4639074Z ) 2025-05-07T20:33:50.4639269Z else: 2025-05-07T20:33:50.4639487Z scale_ub_tensor = None 2025-05-07T20:33:50.4639774Z 2025-05-07T20:33:50.4640016Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4640340Z op = silu_mul_quant 2025-05-07T20:33:50.4640594Z if compiled: 2025-05-07T20:33:50.4640847Z op = torch.compile(op) 2025-05-07T20:33:50.4641147Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4641420Z 2025-05-07T20:33:50.4641613Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.4641778Z 2025-05-07T20:33:50.4641883Z moe/activation_test.py:117: 2025-05-07T20:33:50.4642173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4642513Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.4642799Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4643517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.4644313Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.4644870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4645587Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4646315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4646415Z kernel = self.compile( 2025-05-07T20:33:50.4646815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4646999Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4647128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4647133Z 2025-05-07T20:33:50.4647341Z self = 2025-05-07T20:33:50.4648206Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4648724Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c09d7bec0>} 2025-05-07T20:33:50.4649524Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4649723Z context = 2025-05-07T20:33:50.4649728Z 2025-05-07T20:33:50.4649897Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4650178Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4650291Z module_map=module_map) 2025-05-07T20:33:50.4650455Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4650570Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.4650648Z E ^ 2025-05-07T20:33:50.4651035Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4651040Z 2025-05-07T20:33:50.4651477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4651481Z 2025-05-07T20:33:50.4651595Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4651826Z self=, 2025-05-07T20:33:50.4651909Z T=4096, 2025-05-07T20:33:50.4652000Z D=7168, 2025-05-07T20:33:50.4652083Z scale_ub=None, 2025-05-07T20:33:50.4652227Z contiguous=False, 2025-05-07T20:33:50.4652317Z compiled=False, 2025-05-07T20:33:50.4652394Z ) 2025-05-07T20:33:50.4652626Z self = 2025-05-07T20:33:50.4652810Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:50.4652816Z 2025-05-07T20:33:50.4652899Z @given( 2025-05-07T20:33:50.4653026Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4653130Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4653247Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4653372Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4653488Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4653572Z ) 2025-05-07T20:33:50.4653826Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4653922Z def test_silu_mul_quant( 2025-05-07T20:33:50.4654011Z self, 2025-05-07T20:33:50.4654091Z T: int, 2025-05-07T20:33:50.4654240Z D: int, 2025-05-07T20:33:50.4654351Z scale_ub: Optional[float], 2025-05-07T20:33:50.4654442Z contiguous: bool, 2025-05-07T20:33:50.4654604Z compiled: bool, 2025-05-07T20:33:50.4654694Z ) -> None: 2025-05-07T20:33:50.4654834Z torch.manual_seed(2025) 2025-05-07T20:33:50.4654906Z 2025-05-07T20:33:50.4655085Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4655160Z 2025-05-07T20:33:50.4655250Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4655384Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4655472Z x = x_sign * x_clamp 2025-05-07T20:33:50.4655556Z x0 = x[:, :D] 2025-05-07T20:33:50.4655635Z x1 = x[:, D:] 2025-05-07T20:33:50.4655708Z 2025-05-07T20:33:50.4655794Z if contiguous: 2025-05-07T20:33:50.4655885Z x0 = x0.contiguous() 2025-05-07T20:33:50.4655977Z x1 = x1.contiguous() 2025-05-07T20:33:50.4656054Z 2025-05-07T20:33:50.4656147Z if scale_ub is not None: 2025-05-07T20:33:50.4656251Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4656436Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4656516Z ) 2025-05-07T20:33:50.4656597Z else: 2025-05-07T20:33:50.4656700Z scale_ub_tensor = None 2025-05-07T20:33:50.4656777Z 2025-05-07T20:33:50.4656912Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4657001Z op = silu_mul_quant 2025-05-07T20:33:50.4657083Z if compiled: 2025-05-07T20:33:50.4657185Z op = torch.compile(op) 2025-05-07T20:33:50.4657290Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4657359Z 2025-05-07T20:33:50.4657457Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.4657461Z 2025-05-07T20:33:50.4657556Z moe/activation_test.py:117: 2025-05-07T20:33:50.4657696Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4657801Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.4657902Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4658436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.4658537Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.4658914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4659149Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4659507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4659600Z kernel = self.compile( 2025-05-07T20:33:50.4660005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4660231Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4660368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4660372Z 2025-05-07T20:33:50.4660583Z self = 2025-05-07T20:33:50.4661390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4661910Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c0915ca40>} 2025-05-07T20:33:50.4662700Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4662942Z context = 2025-05-07T20:33:50.4662947Z 2025-05-07T20:33:50.4663120Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4663400Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4663548Z module_map=module_map) 2025-05-07T20:33:50.4663711Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4663818Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.4663895Z E ^ 2025-05-07T20:33:50.4664263Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4664268Z 2025-05-07T20:33:50.4664710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4664717Z 2025-05-07T20:33:50.4664822Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4665057Z self=, 2025-05-07T20:33:50.4665135Z T=128, 2025-05-07T20:33:50.4665253Z D=7168, 2025-05-07T20:33:50.4665342Z scale_ub=None, 2025-05-07T20:33:50.4665432Z contiguous=False, 2025-05-07T20:33:50.4665514Z compiled=True, 2025-05-07T20:33:50.4665591Z ) 2025-05-07T20:33:50.4665812Z self = 2025-05-07T20:33:50.4665984Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:50.4665994Z 2025-05-07T20:33:50.4666074Z @given( 2025-05-07T20:33:50.4666194Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4666301Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4666414Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4666535Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4666658Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4666730Z ) 2025-05-07T20:33:50.4666983Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4667085Z def test_silu_mul_quant( 2025-05-07T20:33:50.4667160Z self, 2025-05-07T20:33:50.4667243Z T: int, 2025-05-07T20:33:50.4667332Z D: int, 2025-05-07T20:33:50.4667434Z scale_ub: Optional[float], 2025-05-07T20:33:50.4667534Z contiguous: bool, 2025-05-07T20:33:50.4667621Z compiled: bool, 2025-05-07T20:33:50.4667697Z ) -> None: 2025-05-07T20:33:50.4667804Z torch.manual_seed(2025) 2025-05-07T20:33:50.4667878Z 2025-05-07T20:33:50.4668047Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4668133Z 2025-05-07T20:33:50.4668225Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4668349Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4668490Z x = x_sign * x_clamp 2025-05-07T20:33:50.4668572Z x0 = x[:, :D] 2025-05-07T20:33:50.4668650Z x1 = x[:, D:] 2025-05-07T20:33:50.4668730Z 2025-05-07T20:33:50.4668810Z if contiguous: 2025-05-07T20:33:50.4668911Z x0 = x0.contiguous() 2025-05-07T20:33:50.4669003Z x1 = x1.contiguous() 2025-05-07T20:33:50.4669074Z 2025-05-07T20:33:50.4669170Z if scale_ub is not None: 2025-05-07T20:33:50.4669274Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4669408Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4669492Z ) 2025-05-07T20:33:50.4669566Z else: 2025-05-07T20:33:50.4669661Z scale_ub_tensor = None 2025-05-07T20:33:50.4669746Z 2025-05-07T20:33:50.4669876Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4669968Z op = silu_mul_quant 2025-05-07T20:33:50.4670062Z if compiled: 2025-05-07T20:33:50.4670161Z op = torch.compile(op) 2025-05-07T20:33:50.4670319Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4670395Z 2025-05-07T20:33:50.4670488Z y_fp8, y_scale = fn() 2025-05-07T20:33:50.4670615Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:50.4670728Z 2025-05-07T20:33:50.4670865Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4670972Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:50.4671068Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:50.4671188Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:50.4671336Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4671410Z 2025-05-07T20:33:50.4671508Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:50.4671518Z 2025-05-07T20:33:50.4671616Z moe/activation_test.py:126: 2025-05-07T20:33:50.4671742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4671857Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:50.4671989Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4672615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:50.4672725Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:50.4673101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4673332Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4673715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:50.4673975Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:50.4674378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:50.4674547Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:50.4674907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:50.4674996Z fn() 2025-05-07T20:33:50.4675415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:50.4675507Z self.fn.run( 2025-05-07T20:33:50.4675860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4675952Z kernel = self.compile( 2025-05-07T20:33:50.4676353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4676528Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4676701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4676715Z 2025-05-07T20:33:50.4676923Z self = 2025-05-07T20:33:50.4677733Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4678257Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c09d48360>} 2025-05-07T20:33:50.4679043Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4679242Z context = 2025-05-07T20:33:50.4679249Z 2025-05-07T20:33:50.4679458Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4679730Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4679842Z module_map=module_map) 2025-05-07T20:33:50.4680044Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4680152Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:50.4680233Z E ^ 2025-05-07T20:33:50.4680599Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4680604Z 2025-05-07T20:33:50.4681042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4681046Z 2025-05-07T20:33:50.4681149Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4681375Z self=, 2025-05-07T20:33:50.4681467Z T=128, 2025-05-07T20:33:50.4681544Z D=7168, 2025-05-07T20:33:50.4681631Z scale_ub=None, 2025-05-07T20:33:50.4681716Z contiguous=False, 2025-05-07T20:33:50.4681840Z compiled=False, 2025-05-07T20:33:50.4681927Z ) 2025-05-07T20:33:50.4682155Z self = 2025-05-07T20:33:50.4682334Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:50.4682339Z 2025-05-07T20:33:50.4682427Z @given( 2025-05-07T20:33:50.4682549Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4682652Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4682780Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4682899Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4683026Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4683108Z ) 2025-05-07T20:33:50.4683366Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4683468Z def test_silu_mul_quant( 2025-05-07T20:33:50.4683554Z self, 2025-05-07T20:33:50.4683636Z T: int, 2025-05-07T20:33:50.4683727Z D: int, 2025-05-07T20:33:50.4683830Z scale_ub: Optional[float], 2025-05-07T20:33:50.4683926Z contiguous: bool, 2025-05-07T20:33:50.4684021Z compiled: bool, 2025-05-07T20:33:50.4684100Z ) -> None: 2025-05-07T20:33:50.4684200Z torch.manual_seed(2025) 2025-05-07T20:33:50.4684287Z 2025-05-07T20:33:50.4684461Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4684545Z 2025-05-07T20:33:50.4684641Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4684769Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4684865Z x = x_sign * x_clamp 2025-05-07T20:33:50.4684947Z x0 = x[:, :D] 2025-05-07T20:33:50.4685104Z x1 = x[:, D:] 2025-05-07T20:33:50.4685192Z 2025-05-07T20:33:50.4685281Z if contiguous: 2025-05-07T20:33:50.4685376Z x0 = x0.contiguous() 2025-05-07T20:33:50.4685474Z x1 = x1.contiguous() 2025-05-07T20:33:50.4685549Z 2025-05-07T20:33:50.4685645Z if scale_ub is not None: 2025-05-07T20:33:50.4685759Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4685896Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4685980Z ) 2025-05-07T20:33:50.4686058Z else: 2025-05-07T20:33:50.4686154Z scale_ub_tensor = None 2025-05-07T20:33:50.4686234Z 2025-05-07T20:33:50.4686364Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4686452Z op = silu_mul_quant 2025-05-07T20:33:50.4686546Z if compiled: 2025-05-07T20:33:50.4686644Z op = torch.compile(op) 2025-05-07T20:33:50.4686748Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4686831Z 2025-05-07T20:33:50.4686962Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.4686967Z 2025-05-07T20:33:50.4687062Z moe/activation_test.py:117: 2025-05-07T20:33:50.4687203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4687305Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.4687450Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4687977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.4688074Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.4688455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4688683Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4689039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4689140Z kernel = self.compile( 2025-05-07T20:33:50.4689544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4689767Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4689901Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4689906Z 2025-05-07T20:33:50.4690146Z self = 2025-05-07T20:33:50.4690979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4691494Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c097c1940>} 2025-05-07T20:33:50.4692294Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4692490Z context = 2025-05-07T20:33:50.4692496Z 2025-05-07T20:33:50.4692673Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4692945Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4693053Z module_map=module_map) 2025-05-07T20:33:50.4693221Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4693318Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.4693392Z E ^ 2025-05-07T20:33:50.4693766Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4693832Z 2025-05-07T20:33:50.4694272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4694277Z 2025-05-07T20:33:50.4694393Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4694730Z self=, 2025-05-07T20:33:50.4694816Z T=4096, 2025-05-07T20:33:50.4694902Z D=5120, 2025-05-07T20:33:50.4694988Z scale_ub=1200.0, 2025-05-07T20:33:50.4695074Z contiguous=True, 2025-05-07T20:33:50.4695166Z compiled=False, 2025-05-07T20:33:50.4695243Z ) 2025-05-07T20:33:50.4695470Z self = 2025-05-07T20:33:50.4695659Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:50.4695663Z 2025-05-07T20:33:50.4695746Z @given( 2025-05-07T20:33:50.4695874Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4695979Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4696148Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4696273Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4696390Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4696465Z ) 2025-05-07T20:33:50.4696764Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4696864Z def test_silu_mul_quant( 2025-05-07T20:33:50.4696950Z self, 2025-05-07T20:33:50.4697028Z T: int, 2025-05-07T20:33:50.4697107Z D: int, 2025-05-07T20:33:50.4697213Z scale_ub: Optional[float], 2025-05-07T20:33:50.4697305Z contiguous: bool, 2025-05-07T20:33:50.4697392Z compiled: bool, 2025-05-07T20:33:50.4697480Z ) -> None: 2025-05-07T20:33:50.4697573Z torch.manual_seed(2025) 2025-05-07T20:33:50.4697648Z 2025-05-07T20:33:50.4697825Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4697900Z 2025-05-07T20:33:50.4697993Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4698126Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4698212Z x = x_sign * x_clamp 2025-05-07T20:33:50.4698334Z x0 = x[:, :D] 2025-05-07T20:33:50.4698424Z x1 = x[:, D:] 2025-05-07T20:33:50.4698497Z 2025-05-07T20:33:50.4698588Z if contiguous: 2025-05-07T20:33:50.4698682Z x0 = x0.contiguous() 2025-05-07T20:33:50.4698771Z x1 = x1.contiguous() 2025-05-07T20:33:50.4698850Z 2025-05-07T20:33:50.4698939Z if scale_ub is not None: 2025-05-07T20:33:50.4699046Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4699183Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4699256Z ) 2025-05-07T20:33:50.4699335Z else: 2025-05-07T20:33:50.4699433Z scale_ub_tensor = None 2025-05-07T20:33:50.4699508Z 2025-05-07T20:33:50.4699637Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4699734Z op = silu_mul_quant 2025-05-07T20:33:50.4699817Z if compiled: 2025-05-07T20:33:50.4699945Z op = torch.compile(op) 2025-05-07T20:33:50.4700063Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4700152Z 2025-05-07T20:33:50.4700248Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.4700252Z 2025-05-07T20:33:50.4700348Z moe/activation_test.py:117: 2025-05-07T20:33:50.4700477Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4700582Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.4700681Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4701200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.4701302Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.4701729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4701962Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4702323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4702420Z kernel = self.compile( 2025-05-07T20:33:50.4702830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4703008Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4703144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4703149Z 2025-05-07T20:33:50.4703358Z self = 2025-05-07T20:33:50.4704209Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4704738Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c097c2a20>} 2025-05-07T20:33:50.4705568Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4705769Z context = 2025-05-07T20:33:50.4705774Z 2025-05-07T20:33:50.4705945Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4706217Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4706333Z module_map=module_map) 2025-05-07T20:33:50.4706503Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4706613Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.4706693Z E ^ 2025-05-07T20:33:50.4707101Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4707109Z 2025-05-07T20:33:50.4707554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4707559Z 2025-05-07T20:33:50.4707667Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4707902Z self=, 2025-05-07T20:33:50.4707983Z T=1, 2025-05-07T20:33:50.4708063Z D=5120, 2025-05-07T20:33:50.4708153Z scale_ub=None, 2025-05-07T20:33:50.4708237Z contiguous=True, 2025-05-07T20:33:50.4708316Z compiled=True, 2025-05-07T20:33:50.4708393Z ) 2025-05-07T20:33:50.4708618Z self = 2025-05-07T20:33:50.4708781Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:50.4708786Z 2025-05-07T20:33:50.4708865Z @given( 2025-05-07T20:33:50.4708988Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4709096Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4709213Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4709333Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4709453Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4709546Z ) 2025-05-07T20:33:50.4709833Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4709933Z def test_silu_mul_quant( 2025-05-07T20:33:50.4710012Z self, 2025-05-07T20:33:50.4710091Z T: int, 2025-05-07T20:33:50.4710177Z D: int, 2025-05-07T20:33:50.4710326Z scale_ub: Optional[float], 2025-05-07T20:33:50.4710418Z contiguous: bool, 2025-05-07T20:33:50.4710512Z compiled: bool, 2025-05-07T20:33:50.4710588Z ) -> None: 2025-05-07T20:33:50.4710684Z torch.manual_seed(2025) 2025-05-07T20:33:50.4710757Z 2025-05-07T20:33:50.4710926Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4711008Z 2025-05-07T20:33:50.4711099Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4711220Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4711316Z x = x_sign * x_clamp 2025-05-07T20:33:50.4711394Z x0 = x[:, :D] 2025-05-07T20:33:50.4711476Z x1 = x[:, D:] 2025-05-07T20:33:50.4711556Z 2025-05-07T20:33:50.4711639Z if contiguous: 2025-05-07T20:33:50.4711731Z x0 = x0.contiguous() 2025-05-07T20:33:50.4711827Z x1 = x1.contiguous() 2025-05-07T20:33:50.4711904Z 2025-05-07T20:33:50.4711993Z if scale_ub is not None: 2025-05-07T20:33:50.4712108Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4717435Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4717528Z ) 2025-05-07T20:33:50.4717618Z else: 2025-05-07T20:33:50.4717716Z scale_ub_tensor = None 2025-05-07T20:33:50.4717792Z 2025-05-07T20:33:50.4717995Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4718090Z op = silu_mul_quant 2025-05-07T20:33:50.4718180Z if compiled: 2025-05-07T20:33:50.4718283Z op = torch.compile(op) 2025-05-07T20:33:50.4718389Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4718463Z 2025-05-07T20:33:50.4718556Z y_fp8, y_scale = fn() 2025-05-07T20:33:50.4718679Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:50.4718757Z 2025-05-07T20:33:50.4718898Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4719003Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:50.4719110Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:50.4719235Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:50.4719453Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4719546Z 2025-05-07T20:33:50.4719662Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:50.4719668Z 2025-05-07T20:33:50.4719791Z moe/activation_test.py:126: 2025-05-07T20:33:50.4719924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4720030Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:50.4720174Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4720768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:50.4720875Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:50.4721263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4721496Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4721897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:50.4722164Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:50.4722562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:50.4722738Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:50.4723103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:50.4723190Z fn() 2025-05-07T20:33:50.4723616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:50.4723747Z self.fn.run( 2025-05-07T20:33:50.4724113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4724209Z kernel = self.compile( 2025-05-07T20:33:50.4724619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4724814Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4724945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4724950Z 2025-05-07T20:33:50.4725169Z self = 2025-05-07T20:33:50.4726244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4726868Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c097c37e0>} 2025-05-07T20:33:50.4727666Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4727926Z context = 2025-05-07T20:33:50.4727930Z 2025-05-07T20:33:50.4728103Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4728381Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4728498Z module_map=module_map) 2025-05-07T20:33:50.4728665Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4728775Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:50.4728861Z E ^ 2025-05-07T20:33:50.4729238Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4729243Z 2025-05-07T20:33:50.4729745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4729760Z 2025-05-07T20:33:50.4729868Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4730099Z self=, 2025-05-07T20:33:50.4730190Z T=2048, 2025-05-07T20:33:50.4730266Z D=5120, 2025-05-07T20:33:50.4730351Z scale_ub=None, 2025-05-07T20:33:50.4730450Z contiguous=True, 2025-05-07T20:33:50.4730534Z compiled=True, 2025-05-07T20:33:50.4730613Z ) 2025-05-07T20:33:50.4730844Z self = 2025-05-07T20:33:50.4731020Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:50.4731026Z 2025-05-07T20:33:50.4731118Z @given( 2025-05-07T20:33:50.4731241Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4731343Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4731467Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4731588Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4731704Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4731789Z ) 2025-05-07T20:33:50.4732043Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4732139Z def test_silu_mul_quant( 2025-05-07T20:33:50.4732221Z self, 2025-05-07T20:33:50.4732301Z T: int, 2025-05-07T20:33:50.4732382Z D: int, 2025-05-07T20:33:50.4732485Z scale_ub: Optional[float], 2025-05-07T20:33:50.4732576Z contiguous: bool, 2025-05-07T20:33:50.4732674Z compiled: bool, 2025-05-07T20:33:50.4732820Z ) -> None: 2025-05-07T20:33:50.4732918Z torch.manual_seed(2025) 2025-05-07T20:33:50.4732999Z 2025-05-07T20:33:50.4733173Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4733254Z 2025-05-07T20:33:50.4733357Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4733482Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4733578Z x = x_sign * x_clamp 2025-05-07T20:33:50.4733665Z x0 = x[:, :D] 2025-05-07T20:33:50.4733753Z x1 = x[:, D:] 2025-05-07T20:33:50.4733827Z 2025-05-07T20:33:50.4733915Z if contiguous: 2025-05-07T20:33:50.4734011Z x0 = x0.contiguous() 2025-05-07T20:33:50.4734105Z x1 = x1.contiguous() 2025-05-07T20:33:50.4734179Z 2025-05-07T20:33:50.4734272Z if scale_ub is not None: 2025-05-07T20:33:50.4734382Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4734583Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4734664Z ) 2025-05-07T20:33:50.4734746Z else: 2025-05-07T20:33:50.4734888Z scale_ub_tensor = None 2025-05-07T20:33:50.4734965Z 2025-05-07T20:33:50.4735100Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4735196Z op = silu_mul_quant 2025-05-07T20:33:50.4735323Z if compiled: 2025-05-07T20:33:50.4735426Z op = torch.compile(op) 2025-05-07T20:33:50.4735534Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4735611Z 2025-05-07T20:33:50.4735703Z y_fp8, y_scale = fn() 2025-05-07T20:33:50.4735824Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:50.4735903Z 2025-05-07T20:33:50.4736041Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4736144Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:50.4736250Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:50.4736373Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:50.4736520Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4736605Z 2025-05-07T20:33:50.4736705Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:50.4736710Z 2025-05-07T20:33:50.4736854Z moe/activation_test.py:126: 2025-05-07T20:33:50.4736991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4737100Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:50.4737241Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4737831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:50.4737934Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:50.4738319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4738551Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4738954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:50.4739224Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:50.4739627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:50.4739802Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:50.4740160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:50.4740243Z fn() 2025-05-07T20:33:50.4740670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:50.4740757Z self.fn.run( 2025-05-07T20:33:50.4741121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4741263Z kernel = self.compile( 2025-05-07T20:33:50.4741663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4741850Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4741982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4741987Z 2025-05-07T20:33:50.4742205Z self = 2025-05-07T20:33:50.4743015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4743535Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c0954de40>} 2025-05-07T20:33:50.4744381Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4744579Z context = 2025-05-07T20:33:50.4744623Z 2025-05-07T20:33:50.4744800Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4745075Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4745186Z module_map=module_map) 2025-05-07T20:33:50.4745354Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4745456Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:50.4745544Z E ^ 2025-05-07T20:33:50.4745910Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4745918Z 2025-05-07T20:33:50.4746356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4746361Z 2025-05-07T20:33:50.4746514Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4746746Z self=, 2025-05-07T20:33:50.4746830Z T=128, 2025-05-07T20:33:50.4746912Z D=5120, 2025-05-07T20:33:50.4746998Z scale_ub=None, 2025-05-07T20:33:50.4747095Z contiguous=True, 2025-05-07T20:33:50.4747178Z compiled=True, 2025-05-07T20:33:50.4747253Z ) 2025-05-07T20:33:50.4747481Z self = 2025-05-07T20:33:50.4747649Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:50.4747654Z 2025-05-07T20:33:50.4747737Z @given( 2025-05-07T20:33:50.4747863Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4747966Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4748082Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4748202Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4748318Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4748402Z ) 2025-05-07T20:33:50.4748656Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4748749Z def test_silu_mul_quant( 2025-05-07T20:33:50.4748834Z self, 2025-05-07T20:33:50.4748912Z T: int, 2025-05-07T20:33:50.4748991Z D: int, 2025-05-07T20:33:50.4749093Z scale_ub: Optional[float], 2025-05-07T20:33:50.4749185Z contiguous: bool, 2025-05-07T20:33:50.4749268Z compiled: bool, 2025-05-07T20:33:50.4749354Z ) -> None: 2025-05-07T20:33:50.4749449Z torch.manual_seed(2025) 2025-05-07T20:33:50.4749525Z 2025-05-07T20:33:50.4749752Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4749827Z 2025-05-07T20:33:50.4749928Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4750053Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4750147Z x = x_sign * x_clamp 2025-05-07T20:33:50.4750233Z x0 = x[:, :D] 2025-05-07T20:33:50.4750317Z x1 = x[:, D:] 2025-05-07T20:33:50.4750389Z 2025-05-07T20:33:50.4750475Z if contiguous: 2025-05-07T20:33:50.4750567Z x0 = x0.contiguous() 2025-05-07T20:33:50.4750655Z x1 = x1.contiguous() 2025-05-07T20:33:50.4750730Z 2025-05-07T20:33:50.4750823Z if scale_ub is not None: 2025-05-07T20:33:50.4750932Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4751069Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4751144Z ) 2025-05-07T20:33:50.4751219Z else: 2025-05-07T20:33:50.4751319Z scale_ub_tensor = None 2025-05-07T20:33:50.4751396Z 2025-05-07T20:33:50.4751577Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4751668Z op = silu_mul_quant 2025-05-07T20:33:50.4751755Z if compiled: 2025-05-07T20:33:50.4751863Z op = torch.compile(op) 2025-05-07T20:33:50.4751973Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4752111Z 2025-05-07T20:33:50.4752206Z y_fp8, y_scale = fn() 2025-05-07T20:33:50.4752328Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:50.4752404Z 2025-05-07T20:33:50.4752545Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4752648Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:50.4752758Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:50.4752882Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:50.4753021Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4753106Z 2025-05-07T20:33:50.4753209Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:50.4753216Z 2025-05-07T20:33:50.4753314Z moe/activation_test.py:126: 2025-05-07T20:33:50.4753450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4753601Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:50.4753740Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4754335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:50.4754438Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:50.4754823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4755053Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4755437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:50.4755714Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:50.4756110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:50.4756282Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:50.4756641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:50.4756719Z fn() 2025-05-07T20:33:50.4757146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:50.4757231Z self.fn.run( 2025-05-07T20:33:50.4757584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4757681Z kernel = self.compile( 2025-05-07T20:33:50.4758081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4758310Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4758442Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4758446Z 2025-05-07T20:33:50.4758655Z self = 2025-05-07T20:33:50.4759464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4759979Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c085a2ac0>} 2025-05-07T20:33:50.4760810Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4761009Z context = 2025-05-07T20:33:50.4761013Z 2025-05-07T20:33:50.4761191Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4761506Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4761614Z module_map=module_map) 2025-05-07T20:33:50.4761779Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4761884Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:50.4761963Z E ^ 2025-05-07T20:33:50.4762343Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4762348Z 2025-05-07T20:33:50.4762782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4762789Z 2025-05-07T20:33:50.4762900Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4763130Z self=, 2025-05-07T20:33:50.4763255Z T=4096, 2025-05-07T20:33:50.4763341Z D=5120, 2025-05-07T20:33:50.4763431Z scale_ub=None, 2025-05-07T20:33:50.4763517Z contiguous=True, 2025-05-07T20:33:50.4763606Z compiled=True, 2025-05-07T20:33:50.4763679Z ) 2025-05-07T20:33:50.4763904Z self = 2025-05-07T20:33:50.4764086Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:50.4764090Z 2025-05-07T20:33:50.4764167Z @given( 2025-05-07T20:33:50.4764291Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4764397Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4764515Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4764644Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4764760Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4764837Z ) 2025-05-07T20:33:50.4765101Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4765196Z def test_silu_mul_quant( 2025-05-07T20:33:50.4765274Z self, 2025-05-07T20:33:50.4765362Z T: int, 2025-05-07T20:33:50.4765443Z D: int, 2025-05-07T20:33:50.4765548Z scale_ub: Optional[float], 2025-05-07T20:33:50.4765641Z contiguous: bool, 2025-05-07T20:33:50.4765728Z compiled: bool, 2025-05-07T20:33:50.4765814Z ) -> None: 2025-05-07T20:33:50.4765906Z torch.manual_seed(2025) 2025-05-07T20:33:50.4765976Z 2025-05-07T20:33:50.4766152Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4766224Z 2025-05-07T20:33:50.4766312Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4766488Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4766578Z x = x_sign * x_clamp 2025-05-07T20:33:50.4766659Z x0 = x[:, :D] 2025-05-07T20:33:50.4766739Z x1 = x[:, D:] 2025-05-07T20:33:50.4766809Z 2025-05-07T20:33:50.4766897Z if contiguous: 2025-05-07T20:33:50.4766991Z x0 = x0.contiguous() 2025-05-07T20:33:50.4767078Z x1 = x1.contiguous() 2025-05-07T20:33:50.4767152Z 2025-05-07T20:33:50.4767241Z if scale_ub is not None: 2025-05-07T20:33:50.4767345Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4767480Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4767553Z ) 2025-05-07T20:33:50.4767629Z else: 2025-05-07T20:33:50.4767723Z scale_ub_tensor = None 2025-05-07T20:33:50.4767792Z 2025-05-07T20:33:50.4767919Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4768009Z op = silu_mul_quant 2025-05-07T20:33:50.4768095Z if compiled: 2025-05-07T20:33:50.4768238Z op = torch.compile(op) 2025-05-07T20:33:50.4768346Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4768418Z 2025-05-07T20:33:50.4768513Z y_fp8, y_scale = fn() 2025-05-07T20:33:50.4768634Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:50.4768750Z 2025-05-07T20:33:50.4768886Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4768983Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:50.4769081Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:50.4769208Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:50.4769343Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4769416Z 2025-05-07T20:33:50.4769522Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:50.4769526Z 2025-05-07T20:33:50.4769622Z moe/activation_test.py:126: 2025-05-07T20:33:50.4769764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4769864Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:50.4769993Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4770631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:50.4770736Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:50.4771112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4771346Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4771727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:50.4771996Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:50.4772391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:50.4772556Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:50.4772917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:50.4772992Z fn() 2025-05-07T20:33:50.4773415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:50.4773495Z self.fn.run( 2025-05-07T20:33:50.4773848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4773944Z kernel = self.compile( 2025-05-07T20:33:50.4774344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4774600Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4774787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4774791Z 2025-05-07T20:33:50.4775000Z self = 2025-05-07T20:33:50.4775818Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4776334Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c0853b4c0>} 2025-05-07T20:33:50.4777124Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4777322Z context = 2025-05-07T20:33:50.4777366Z 2025-05-07T20:33:50.4777535Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4777819Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4777973Z module_map=module_map) 2025-05-07T20:33:50.4778135Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4778243Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:50.4778322Z E ^ 2025-05-07T20:33:50.4778702Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4778706Z 2025-05-07T20:33:50.4779146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4779150Z 2025-05-07T20:33:50.4779257Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4779492Z self=, 2025-05-07T20:33:50.4779573Z T=16384, 2025-05-07T20:33:50.4779657Z D=5120, 2025-05-07T20:33:50.4779738Z scale_ub=None, 2025-05-07T20:33:50.4779866Z contiguous=True, 2025-05-07T20:33:50.4779956Z compiled=True, 2025-05-07T20:33:50.4780033Z ) 2025-05-07T20:33:50.4780257Z self = 2025-05-07T20:33:50.4780439Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:50.4780443Z 2025-05-07T20:33:50.4780525Z @given( 2025-05-07T20:33:50.4780645Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4780751Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4780866Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4780984Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4781108Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4781183Z ) 2025-05-07T20:33:50.4781446Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4781542Z def test_silu_mul_quant( 2025-05-07T20:33:50.4781622Z self, 2025-05-07T20:33:50.4781704Z T: int, 2025-05-07T20:33:50.4781779Z D: int, 2025-05-07T20:33:50.4781884Z scale_ub: Optional[float], 2025-05-07T20:33:50.4781972Z contiguous: bool, 2025-05-07T20:33:50.4782057Z compiled: bool, 2025-05-07T20:33:50.4782141Z ) -> None: 2025-05-07T20:33:50.4782237Z torch.manual_seed(2025) 2025-05-07T20:33:50.4782311Z 2025-05-07T20:33:50.4782486Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4782563Z 2025-05-07T20:33:50.4782660Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4782784Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4782874Z x = x_sign * x_clamp 2025-05-07T20:33:50.4783003Z x0 = x[:, :D] 2025-05-07T20:33:50.4783084Z x1 = x[:, D:] 2025-05-07T20:33:50.4783157Z 2025-05-07T20:33:50.4783245Z if contiguous: 2025-05-07T20:33:50.4783338Z x0 = x0.contiguous() 2025-05-07T20:33:50.4783431Z x1 = x1.contiguous() 2025-05-07T20:33:50.4783508Z 2025-05-07T20:33:50.4783600Z if scale_ub is not None: 2025-05-07T20:33:50.4783707Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4783847Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4783923Z ) 2025-05-07T20:33:50.4784002Z else: 2025-05-07T20:33:50.4784095Z scale_ub_tensor = None 2025-05-07T20:33:50.4784167Z 2025-05-07T20:33:50.4784301Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4784391Z op = silu_mul_quant 2025-05-07T20:33:50.4784478Z if compiled: 2025-05-07T20:33:50.4784581Z op = torch.compile(op) 2025-05-07T20:33:50.4784690Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4784765Z 2025-05-07T20:33:50.4784928Z y_fp8, y_scale = fn() 2025-05-07T20:33:50.4785050Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:50.4785119Z 2025-05-07T20:33:50.4785263Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4785408Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:50.4785516Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:50.4785637Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:50.4785778Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4785858Z 2025-05-07T20:33:50.4785959Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:50.4785964Z 2025-05-07T20:33:50.4786060Z moe/activation_test.py:126: 2025-05-07T20:33:50.4786193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4786303Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:50.4786442Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4787086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:50.4787191Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:50.4787576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4787803Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4788185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:50.4788451Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:50.4788847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:50.4789032Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:50.4789393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:50.4789470Z fn() 2025-05-07T20:33:50.4789947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:50.4790032Z self.fn.run( 2025-05-07T20:33:50.4790387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4790488Z kernel = self.compile( 2025-05-07T20:33:50.4790886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4791069Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4791200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4791252Z 2025-05-07T20:33:50.4791465Z self = 2025-05-07T20:33:50.4792288Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4792811Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1df41580>} 2025-05-07T20:33:50.4793602Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4793798Z context = 2025-05-07T20:33:50.4793803Z 2025-05-07T20:33:50.4793975Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4794293Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4794402Z module_map=module_map) 2025-05-07T20:33:50.4794572Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4794718Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:50.4794793Z E ^ 2025-05-07T20:33:50.4795165Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4795169Z 2025-05-07T20:33:50.4795601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4795606Z 2025-05-07T20:33:50.4795715Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4795942Z self=, 2025-05-07T20:33:50.4796020Z T=1, 2025-05-07T20:33:50.4796102Z D=5120, 2025-05-07T20:33:50.4796185Z scale_ub=1200.0, 2025-05-07T20:33:50.4796277Z contiguous=True, 2025-05-07T20:33:50.4796360Z compiled=True, 2025-05-07T20:33:50.4796431Z ) 2025-05-07T20:33:50.4796696Z self = 2025-05-07T20:33:50.4796869Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:50.4796877Z 2025-05-07T20:33:50.4796955Z @given( 2025-05-07T20:33:50.4797080Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4797184Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4797299Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4797422Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4797539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4797619Z ) 2025-05-07T20:33:50.4797875Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4797973Z def test_silu_mul_quant( 2025-05-07T20:33:50.4798057Z self, 2025-05-07T20:33:50.4798136Z T: int, 2025-05-07T20:33:50.4798216Z D: int, 2025-05-07T20:33:50.4798321Z scale_ub: Optional[float], 2025-05-07T20:33:50.4798412Z contiguous: bool, 2025-05-07T20:33:50.4798497Z compiled: bool, 2025-05-07T20:33:50.4798588Z ) -> None: 2025-05-07T20:33:50.4798683Z torch.manual_seed(2025) 2025-05-07T20:33:50.4798755Z 2025-05-07T20:33:50.4798934Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4799009Z 2025-05-07T20:33:50.4799101Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4799234Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4799324Z x = x_sign * x_clamp 2025-05-07T20:33:50.4799405Z x0 = x[:, :D] 2025-05-07T20:33:50.4799493Z x1 = x[:, D:] 2025-05-07T20:33:50.4799582Z 2025-05-07T20:33:50.4799732Z if contiguous: 2025-05-07T20:33:50.4799837Z x0 = x0.contiguous() 2025-05-07T20:33:50.4799927Z x1 = x1.contiguous() 2025-05-07T20:33:50.4800001Z 2025-05-07T20:33:50.4800092Z if scale_ub is not None: 2025-05-07T20:33:50.4800200Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4800336Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4800418Z ) 2025-05-07T20:33:50.4800492Z else: 2025-05-07T20:33:50.4800591Z scale_ub_tensor = None 2025-05-07T20:33:50.4800663Z 2025-05-07T20:33:50.4800792Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4800883Z op = silu_mul_quant 2025-05-07T20:33:50.4800965Z if compiled: 2025-05-07T20:33:50.4801068Z op = torch.compile(op) 2025-05-07T20:33:50.4801175Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4801251Z 2025-05-07T20:33:50.4801345Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.4801352Z 2025-05-07T20:33:50.4801450Z moe/activation_test.py:117: 2025-05-07T20:33:50.4801627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4801734Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.4801837Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4802222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.4802362Z return fn(*args, **kwargs) 2025-05-07T20:33:50.4802884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.4802993Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.4803367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4803594Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4803962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4804055Z kernel = self.compile( 2025-05-07T20:33:50.4804501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4804683Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4804813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4804817Z 2025-05-07T20:33:50.4805030Z self = 2025-05-07T20:33:50.4805836Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4806352Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1da20680>} 2025-05-07T20:33:50.4807147Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4807344Z context = 2025-05-07T20:33:50.4807348Z 2025-05-07T20:33:50.4807522Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4807791Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4807901Z module_map=module_map) 2025-05-07T20:33:50.4808063Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4808162Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.4808242Z E ^ 2025-05-07T20:33:50.4808658Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4808662Z 2025-05-07T20:33:50.4809099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4809108Z 2025-05-07T20:33:50.4809214Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4809443Z self=, 2025-05-07T20:33:50.4809527Z T=1, 2025-05-07T20:33:50.4809605Z D=5120, 2025-05-07T20:33:50.4809683Z scale_ub=None, 2025-05-07T20:33:50.4809771Z contiguous=False, 2025-05-07T20:33:50.4809853Z compiled=True, 2025-05-07T20:33:50.4809938Z ) 2025-05-07T20:33:50.4810200Z self = 2025-05-07T20:33:50.4810365Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:50.4810369Z 2025-05-07T20:33:50.4810455Z @given( 2025-05-07T20:33:50.4810573Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4810715Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4810832Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4810953Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4811066Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4811187Z ) 2025-05-07T20:33:50.4811438Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4811529Z def test_silu_mul_quant( 2025-05-07T20:33:50.4811611Z self, 2025-05-07T20:33:50.4811687Z T: int, 2025-05-07T20:33:50.4811763Z D: int, 2025-05-07T20:33:50.4811868Z scale_ub: Optional[float], 2025-05-07T20:33:50.4811957Z contiguous: bool, 2025-05-07T20:33:50.4812046Z compiled: bool, 2025-05-07T20:33:50.4812124Z ) -> None: 2025-05-07T20:33:50.4812219Z torch.manual_seed(2025) 2025-05-07T20:33:50.4812297Z 2025-05-07T20:33:50.4812470Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4812546Z 2025-05-07T20:33:50.4812645Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4812810Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4812903Z x = x_sign * x_clamp 2025-05-07T20:33:50.4812994Z x0 = x[:, :D] 2025-05-07T20:33:50.4813075Z x1 = x[:, D:] 2025-05-07T20:33:50.4813145Z 2025-05-07T20:33:50.4813235Z if contiguous: 2025-05-07T20:33:50.4813326Z x0 = x0.contiguous() 2025-05-07T20:33:50.4813421Z x1 = x1.contiguous() 2025-05-07T20:33:50.4813494Z 2025-05-07T20:33:50.4813585Z if scale_ub is not None: 2025-05-07T20:33:50.4813694Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4813827Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4813904Z ) 2025-05-07T20:33:50.4813983Z else: 2025-05-07T20:33:50.4814079Z scale_ub_tensor = None 2025-05-07T20:33:50.4814149Z 2025-05-07T20:33:50.4814286Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4814376Z op = silu_mul_quant 2025-05-07T20:33:50.4814462Z if compiled: 2025-05-07T20:33:50.4814642Z op = torch.compile(op) 2025-05-07T20:33:50.4814751Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4814828Z 2025-05-07T20:33:50.4814919Z y_fp8, y_scale = fn() 2025-05-07T20:33:50.4815043Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:50.4815122Z 2025-05-07T20:33:50.4815257Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4815361Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:50.4815462Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:50.4815583Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:50.4815723Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4815872Z 2025-05-07T20:33:50.4815977Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:50.4815982Z 2025-05-07T20:33:50.4816086Z moe/activation_test.py:126: 2025-05-07T20:33:50.4816222Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4816328Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:50.4816464Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4817052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:50.4817152Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:50.4817530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4817757Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4818189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:50.4818453Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:50.4818849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:50.4819063Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:50.4819421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:50.4819498Z fn() 2025-05-07T20:33:50.4819926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:50.4820008Z self.fn.run( 2025-05-07T20:33:50.4820368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4820465Z kernel = self.compile( 2025-05-07T20:33:50.4820868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4821052Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4821224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4821232Z 2025-05-07T20:33:50.4821445Z self = 2025-05-07T20:33:50.4822252Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4822768Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1da22b60>} 2025-05-07T20:33:50.4823569Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4823769Z context = 2025-05-07T20:33:50.4823773Z 2025-05-07T20:33:50.4823946Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4824217Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4824323Z module_map=module_map) 2025-05-07T20:33:50.4824490Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4824596Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:50.4824679Z E ^ 2025-05-07T20:33:50.4825050Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4825055Z 2025-05-07T20:33:50.4825761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4825769Z 2025-05-07T20:33:50.4825905Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4826134Z self=, 2025-05-07T20:33:50.4826215Z T=1, 2025-05-07T20:33:50.4826295Z D=5120, 2025-05-07T20:33:50.4826376Z scale_ub=None, 2025-05-07T20:33:50.4826466Z contiguous=True, 2025-05-07T20:33:50.4826554Z compiled=False, 2025-05-07T20:33:50.4826626Z ) 2025-05-07T20:33:50.4826852Z self = 2025-05-07T20:33:50.4827018Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:50.4827022Z 2025-05-07T20:33:50.4827099Z @given( 2025-05-07T20:33:50.4827222Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4827323Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4827440Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4827656Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4827772Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4827854Z ) 2025-05-07T20:33:50.4828110Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4828263Z def test_silu_mul_quant( 2025-05-07T20:33:50.4828345Z self, 2025-05-07T20:33:50.4828424Z T: int, 2025-05-07T20:33:50.4828499Z D: int, 2025-05-07T20:33:50.4828599Z scale_ub: Optional[float], 2025-05-07T20:33:50.4828689Z contiguous: bool, 2025-05-07T20:33:50.4828772Z compiled: bool, 2025-05-07T20:33:50.4828853Z ) -> None: 2025-05-07T20:33:50.4828949Z torch.manual_seed(2025) 2025-05-07T20:33:50.4829020Z 2025-05-07T20:33:50.4829197Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4829272Z 2025-05-07T20:33:50.4829371Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4829497Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4829606Z x = x_sign * x_clamp 2025-05-07T20:33:50.4829692Z x0 = x[:, :D] 2025-05-07T20:33:50.4829857Z x1 = x[:, D:] 2025-05-07T20:33:50.4829934Z 2025-05-07T20:33:50.4830023Z if contiguous: 2025-05-07T20:33:50.4830115Z x0 = x0.contiguous() 2025-05-07T20:33:50.4830205Z x1 = x1.contiguous() 2025-05-07T20:33:50.4830282Z 2025-05-07T20:33:50.4830372Z if scale_ub is not None: 2025-05-07T20:33:50.4830475Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4830613Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4830687Z ) 2025-05-07T20:33:50.4830770Z else: 2025-05-07T20:33:50.4830866Z scale_ub_tensor = None 2025-05-07T20:33:50.4830942Z 2025-05-07T20:33:50.4831071Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4831164Z op = silu_mul_quant 2025-05-07T20:33:50.4831253Z if compiled: 2025-05-07T20:33:50.4831357Z op = torch.compile(op) 2025-05-07T20:33:50.4831463Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4831545Z 2025-05-07T20:33:50.4831641Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.4831648Z 2025-05-07T20:33:50.4831745Z moe/activation_test.py:117: 2025-05-07T20:33:50.4831876Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4831983Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.4832083Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4832615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.4832713Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.4833085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4833420Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4833778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4833874Z kernel = self.compile( 2025-05-07T20:33:50.4834280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4834456Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4834590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4834594Z 2025-05-07T20:33:50.4834801Z self = 2025-05-07T20:33:50.4835607Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4836182Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1da239c0>} 2025-05-07T20:33:50.4836970Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4837210Z context = 2025-05-07T20:33:50.4837215Z 2025-05-07T20:33:50.4837384Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4837661Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4837767Z module_map=module_map) 2025-05-07T20:33:50.4837928Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4838034Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.4838112Z E ^ 2025-05-07T20:33:50.4838519Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4838524Z 2025-05-07T20:33:50.4838966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4838973Z 2025-05-07T20:33:50.4839076Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4839310Z self=, 2025-05-07T20:33:50.4839389Z T=128, 2025-05-07T20:33:50.4839467Z D=5120, 2025-05-07T20:33:50.4839556Z scale_ub=None, 2025-05-07T20:33:50.4839639Z contiguous=False, 2025-05-07T20:33:50.4839725Z compiled=True, 2025-05-07T20:33:50.4839802Z ) 2025-05-07T20:33:50.4840027Z self = 2025-05-07T20:33:50.4840212Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:50.4840217Z 2025-05-07T20:33:50.4840294Z @given( 2025-05-07T20:33:50.4840412Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4840517Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4840635Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4840757Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4840874Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4840948Z ) 2025-05-07T20:33:50.4841201Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4841300Z def test_silu_mul_quant( 2025-05-07T20:33:50.4841379Z self, 2025-05-07T20:33:50.4841462Z T: int, 2025-05-07T20:33:50.4841539Z D: int, 2025-05-07T20:33:50.4841638Z scale_ub: Optional[float], 2025-05-07T20:33:50.4841730Z contiguous: bool, 2025-05-07T20:33:50.4841860Z compiled: bool, 2025-05-07T20:33:50.4841942Z ) -> None: 2025-05-07T20:33:50.4842038Z torch.manual_seed(2025) 2025-05-07T20:33:50.4842112Z 2025-05-07T20:33:50.4842284Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4842364Z 2025-05-07T20:33:50.4842671Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4842799Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4842892Z x = x_sign * x_clamp 2025-05-07T20:33:50.4842971Z x0 = x[:, :D] 2025-05-07T20:33:50.4843060Z x1 = x[:, D:] 2025-05-07T20:33:50.4843133Z 2025-05-07T20:33:50.4843216Z if contiguous: 2025-05-07T20:33:50.4843309Z x0 = x0.contiguous() 2025-05-07T20:33:50.4843398Z x1 = x1.contiguous() 2025-05-07T20:33:50.4843473Z 2025-05-07T20:33:50.4843566Z if scale_ub is not None: 2025-05-07T20:33:50.4843672Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4843813Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4848242Z ) 2025-05-07T20:33:50.4848340Z else: 2025-05-07T20:33:50.4848445Z scale_ub_tensor = None 2025-05-07T20:33:50.4848521Z 2025-05-07T20:33:50.4848660Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4848801Z op = silu_mul_quant 2025-05-07T20:33:50.4848892Z if compiled: 2025-05-07T20:33:50.4848999Z op = torch.compile(op) 2025-05-07T20:33:50.4849108Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4849187Z 2025-05-07T20:33:50.4849282Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.4849287Z 2025-05-07T20:33:50.4849387Z moe/activation_test.py:117: 2025-05-07T20:33:50.4849526Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4849634Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.4849737Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4850136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.4850233Z return fn(*args, **kwargs) 2025-05-07T20:33:50.4850825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.4850930Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.4851307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4851538Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4851896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4851993Z kernel = self.compile( 2025-05-07T20:33:50.4852399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4852586Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4852722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4852727Z 2025-05-07T20:33:50.4852942Z self = 2025-05-07T20:33:50.4853753Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4854273Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1da20a40>} 2025-05-07T20:33:50.4855130Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4855381Z context = 2025-05-07T20:33:50.4855386Z 2025-05-07T20:33:50.4855555Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4855841Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4855955Z module_map=module_map) 2025-05-07T20:33:50.4856120Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4856226Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.4856305Z E ^ 2025-05-07T20:33:50.4856677Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4856682Z 2025-05-07T20:33:50.4857119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4857127Z 2025-05-07T20:33:50.4857233Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4857509Z self=, 2025-05-07T20:33:50.4857592Z T=128, 2025-05-07T20:33:50.4857672Z D=7168, 2025-05-07T20:33:50.4857767Z scale_ub=1200.0, 2025-05-07T20:33:50.4857858Z contiguous=False, 2025-05-07T20:33:50.4857985Z compiled=False, 2025-05-07T20:33:50.4858066Z ) 2025-05-07T20:33:50.4858292Z self = 2025-05-07T20:33:50.4858472Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:50.4858479Z 2025-05-07T20:33:50.4858560Z @given( 2025-05-07T20:33:50.4858685Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4858793Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4858910Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4859031Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4859157Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4859235Z ) 2025-05-07T20:33:50.4859492Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4859636Z def test_silu_mul_quant( 2025-05-07T20:33:50.4859717Z self, 2025-05-07T20:33:50.4859799Z T: int, 2025-05-07T20:33:50.4859885Z D: int, 2025-05-07T20:33:50.4859990Z scale_ub: Optional[float], 2025-05-07T20:33:50.4860090Z contiguous: bool, 2025-05-07T20:33:50.4860179Z compiled: bool, 2025-05-07T20:33:50.4860261Z ) -> None: 2025-05-07T20:33:50.4860363Z torch.manual_seed(2025) 2025-05-07T20:33:50.4860441Z 2025-05-07T20:33:50.4860617Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4860698Z 2025-05-07T20:33:50.4860792Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4860919Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4861019Z x = x_sign * x_clamp 2025-05-07T20:33:50.4861107Z x0 = x[:, :D] 2025-05-07T20:33:50.4861191Z x1 = x[:, D:] 2025-05-07T20:33:50.4861269Z 2025-05-07T20:33:50.4861355Z if contiguous: 2025-05-07T20:33:50.4861452Z x0 = x0.contiguous() 2025-05-07T20:33:50.4861546Z x1 = x1.contiguous() 2025-05-07T20:33:50.4861625Z 2025-05-07T20:33:50.4861723Z if scale_ub is not None: 2025-05-07T20:33:50.4861831Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4861967Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4862049Z ) 2025-05-07T20:33:50.4862129Z else: 2025-05-07T20:33:50.4862228Z scale_ub_tensor = None 2025-05-07T20:33:50.4862307Z 2025-05-07T20:33:50.4862438Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4862530Z op = silu_mul_quant 2025-05-07T20:33:50.4862623Z if compiled: 2025-05-07T20:33:50.4862771Z op = torch.compile(op) 2025-05-07T20:33:50.4862882Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4862960Z 2025-05-07T20:33:50.4863054Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.4863058Z 2025-05-07T20:33:50.4863163Z moe/activation_test.py:117: 2025-05-07T20:33:50.4863301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4863409Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.4863517Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4864043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.4864145Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.4864527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4864755Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4865164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4865263Z kernel = self.compile( 2025-05-07T20:33:50.4865669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4865892Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4866023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4866027Z 2025-05-07T20:33:50.4866244Z self = 2025-05-07T20:33:50.4867052Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4867572Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c08538a40>} 2025-05-07T20:33:50.4868404Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4868604Z context = 2025-05-07T20:33:50.4868609Z 2025-05-07T20:33:50.4868783Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4869057Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4869169Z module_map=module_map) 2025-05-07T20:33:50.4869336Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4869449Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.4869550Z E ^ 2025-05-07T20:33:50.4869948Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4869953Z 2025-05-07T20:33:50.4870390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4870397Z 2025-05-07T20:33:50.4870508Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4870739Z self=, 2025-05-07T20:33:50.4870821Z T=128, 2025-05-07T20:33:50.4870900Z D=5120, 2025-05-07T20:33:50.4870986Z scale_ub=None, 2025-05-07T20:33:50.4871080Z contiguous=False, 2025-05-07T20:33:50.4871166Z compiled=False, 2025-05-07T20:33:50.4871244Z ) 2025-05-07T20:33:50.4871475Z self = 2025-05-07T20:33:50.4871651Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:50.4871699Z 2025-05-07T20:33:50.4871782Z @given( 2025-05-07T20:33:50.4871910Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4872013Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4872135Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4872259Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4872377Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4872457Z ) 2025-05-07T20:33:50.4872710Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4872809Z def test_silu_mul_quant( 2025-05-07T20:33:50.4872894Z self, 2025-05-07T20:33:50.4872975Z T: int, 2025-05-07T20:33:50.4873054Z D: int, 2025-05-07T20:33:50.4873162Z scale_ub: Optional[float], 2025-05-07T20:33:50.4873255Z contiguous: bool, 2025-05-07T20:33:50.4873345Z compiled: bool, 2025-05-07T20:33:50.4873429Z ) -> None: 2025-05-07T20:33:50.4873530Z torch.manual_seed(2025) 2025-05-07T20:33:50.4873609Z 2025-05-07T20:33:50.4873834Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4873913Z 2025-05-07T20:33:50.4874014Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4874143Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4874272Z x = x_sign * x_clamp 2025-05-07T20:33:50.4874360Z x0 = x[:, :D] 2025-05-07T20:33:50.4874446Z x1 = x[:, D:] 2025-05-07T20:33:50.4874526Z 2025-05-07T20:33:50.4874616Z if contiguous: 2025-05-07T20:33:50.4874713Z x0 = x0.contiguous() 2025-05-07T20:33:50.4874807Z x1 = x1.contiguous() 2025-05-07T20:33:50.4874887Z 2025-05-07T20:33:50.4874981Z if scale_ub is not None: 2025-05-07T20:33:50.4875090Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4875231Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4875314Z ) 2025-05-07T20:33:50.4875399Z else: 2025-05-07T20:33:50.4875497Z scale_ub_tensor = None 2025-05-07T20:33:50.4875573Z 2025-05-07T20:33:50.4875709Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4875846Z op = silu_mul_quant 2025-05-07T20:33:50.4875935Z if compiled: 2025-05-07T20:33:50.4876044Z op = torch.compile(op) 2025-05-07T20:33:50.4876151Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4876227Z 2025-05-07T20:33:50.4876324Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.4876329Z 2025-05-07T20:33:50.4876429Z moe/activation_test.py:117: 2025-05-07T20:33:50.4876564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4876672Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.4876773Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4877303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.4877410Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.4877786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4878028Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4878389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4878489Z kernel = self.compile( 2025-05-07T20:33:50.4878894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4879072Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4879208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4879213Z 2025-05-07T20:33:50.4879423Z self = 2025-05-07T20:33:50.4880284Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4880801Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1da2c400>} 2025-05-07T20:33:50.4881594Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4881792Z context = 2025-05-07T20:33:50.4881797Z 2025-05-07T20:33:50.4881967Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4882247Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4882398Z module_map=module_map) 2025-05-07T20:33:50.4882566Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4882672Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.4882751Z E ^ 2025-05-07T20:33:50.4883185Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4883194Z 2025-05-07T20:33:50.4883632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4883637Z 2025-05-07T20:33:50.4883741Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4883973Z self=, 2025-05-07T20:33:50.4884052Z T=128, 2025-05-07T20:33:50.4884133Z D=5120, 2025-05-07T20:33:50.4884226Z scale_ub=1200.0, 2025-05-07T20:33:50.4884317Z contiguous=True, 2025-05-07T20:33:50.4884403Z compiled=False, 2025-05-07T20:33:50.4884484Z ) 2025-05-07T20:33:50.4884708Z self = 2025-05-07T20:33:50.4884929Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:50.4884937Z 2025-05-07T20:33:50.4885018Z @given( 2025-05-07T20:33:50.4885139Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4885246Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4885365Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4885483Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4885606Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4885682Z ) 2025-05-07T20:33:50.4885935Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4886032Z def test_silu_mul_quant( 2025-05-07T20:33:50.4886115Z self, 2025-05-07T20:33:50.4886195Z T: int, 2025-05-07T20:33:50.4886279Z D: int, 2025-05-07T20:33:50.4886381Z scale_ub: Optional[float], 2025-05-07T20:33:50.4886478Z contiguous: bool, 2025-05-07T20:33:50.4886565Z compiled: bool, 2025-05-07T20:33:50.4886652Z ) -> None: 2025-05-07T20:33:50.4886755Z torch.manual_seed(2025) 2025-05-07T20:33:50.4886830Z 2025-05-07T20:33:50.4887003Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4887084Z 2025-05-07T20:33:50.4887181Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4887308Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4887401Z x = x_sign * x_clamp 2025-05-07T20:33:50.4887482Z x0 = x[:, :D] 2025-05-07T20:33:50.4887570Z x1 = x[:, D:] 2025-05-07T20:33:50.4887646Z 2025-05-07T20:33:50.4887733Z if contiguous: 2025-05-07T20:33:50.4887830Z x0 = x0.contiguous() 2025-05-07T20:33:50.4887968Z x1 = x1.contiguous() 2025-05-07T20:33:50.4888044Z 2025-05-07T20:33:50.4888146Z if scale_ub is not None: 2025-05-07T20:33:50.4888255Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4888393Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4888475Z ) 2025-05-07T20:33:50.4888558Z else: 2025-05-07T20:33:50.4888654Z scale_ub_tensor = None 2025-05-07T20:33:50.4888731Z 2025-05-07T20:33:50.4888866Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4888963Z op = silu_mul_quant 2025-05-07T20:33:50.4889054Z if compiled: 2025-05-07T20:33:50.4889156Z op = torch.compile(op) 2025-05-07T20:33:50.4889270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4889345Z 2025-05-07T20:33:50.4889437Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.4889441Z 2025-05-07T20:33:50.4889542Z moe/activation_test.py:117: 2025-05-07T20:33:50.4889722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4889828Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.4889936Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4890465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.4890610Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.4890987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4891216Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4891578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4891676Z kernel = self.compile( 2025-05-07T20:33:50.4892081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4892270Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4892402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4892406Z 2025-05-07T20:33:50.4892659Z self = 2025-05-07T20:33:50.4893472Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4893990Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1da2d300>} 2025-05-07T20:33:50.4894889Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4895095Z context = 2025-05-07T20:33:50.4895100Z 2025-05-07T20:33:50.4895276Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4895550Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4895671Z module_map=module_map) 2025-05-07T20:33:50.4895836Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4895938Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.4896023Z E ^ 2025-05-07T20:33:50.4896393Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4896398Z 2025-05-07T20:33:50.4896832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4896881Z 2025-05-07T20:33:50.4896992Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4897227Z self=, 2025-05-07T20:33:50.4897313Z T=1, 2025-05-07T20:33:50.4897396Z D=7168, 2025-05-07T20:33:50.4897483Z scale_ub=1200.0, 2025-05-07T20:33:50.4897578Z contiguous=True, 2025-05-07T20:33:50.4897662Z compiled=True, 2025-05-07T20:33:50.4897740Z ) 2025-05-07T20:33:50.4897970Z self = 2025-05-07T20:33:50.4898136Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:50.4898141Z 2025-05-07T20:33:50.4898223Z @given( 2025-05-07T20:33:50.4898347Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4898448Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4898568Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4898687Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4898848Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4898931Z ) 2025-05-07T20:33:50.4899183Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4899280Z def test_silu_mul_quant( 2025-05-07T20:33:50.4899361Z self, 2025-05-07T20:33:50.4899480Z T: int, 2025-05-07T20:33:50.4899558Z D: int, 2025-05-07T20:33:50.4899662Z scale_ub: Optional[float], 2025-05-07T20:33:50.4899755Z contiguous: bool, 2025-05-07T20:33:50.4899845Z compiled: bool, 2025-05-07T20:33:50.4899927Z ) -> None: 2025-05-07T20:33:50.4900020Z torch.manual_seed(2025) 2025-05-07T20:33:50.4900094Z 2025-05-07T20:33:50.4900263Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4900340Z 2025-05-07T20:33:50.4900436Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4900561Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4900649Z x = x_sign * x_clamp 2025-05-07T20:33:50.4900739Z x0 = x[:, :D] 2025-05-07T20:33:50.4900819Z x1 = x[:, D:] 2025-05-07T20:33:50.4900890Z 2025-05-07T20:33:50.4900977Z if contiguous: 2025-05-07T20:33:50.4901108Z x0 = x0.contiguous() 2025-05-07T20:33:50.4901197Z x1 = x1.contiguous() 2025-05-07T20:33:50.4901276Z 2025-05-07T20:33:50.4901364Z if scale_ub is not None: 2025-05-07T20:33:50.4901468Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4901606Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4901683Z ) 2025-05-07T20:33:50.4901765Z else: 2025-05-07T20:33:50.4901860Z scale_ub_tensor = None 2025-05-07T20:33:50.4901931Z 2025-05-07T20:33:50.4902064Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4902154Z op = silu_mul_quant 2025-05-07T20:33:50.4902241Z if compiled: 2025-05-07T20:33:50.4902346Z op = torch.compile(op) 2025-05-07T20:33:50.4902455Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4902528Z 2025-05-07T20:33:50.4902623Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.4902627Z 2025-05-07T20:33:50.4902726Z moe/activation_test.py:117: 2025-05-07T20:33:50.4902863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4902961Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.4903058Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4903443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.4903534Z return fn(*args, **kwargs) 2025-05-07T20:33:50.4904051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.4904156Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.4904585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4904823Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4905185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4905283Z kernel = self.compile( 2025-05-07T20:33:50.4905692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4905870Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4906001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4906009Z 2025-05-07T20:33:50.4906219Z self = 2025-05-07T20:33:50.4907071Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4907601Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1da2eac0>} 2025-05-07T20:33:50.4908433Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4908632Z context = 2025-05-07T20:33:50.4908637Z 2025-05-07T20:33:50.4908809Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4909082Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4909194Z module_map=module_map) 2025-05-07T20:33:50.4909366Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4909468Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.4909551Z E ^ 2025-05-07T20:33:50.4909961Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4909969Z 2025-05-07T20:33:50.4910410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4910415Z 2025-05-07T20:33:50.4910521Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4910749Z self=, 2025-05-07T20:33:50.4910836Z T=1, 2025-05-07T20:33:50.4910924Z D=7168, 2025-05-07T20:33:50.4911011Z scale_ub=1200.0, 2025-05-07T20:33:50.4911100Z contiguous=False, 2025-05-07T20:33:50.4911188Z compiled=True, 2025-05-07T20:33:50.4911263Z ) 2025-05-07T20:33:50.4911490Z self = 2025-05-07T20:33:50.4911666Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:50.4911671Z 2025-05-07T20:33:50.4911751Z @given( 2025-05-07T20:33:50.4911875Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4911985Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4912100Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4912221Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4912337Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4912414Z ) 2025-05-07T20:33:50.4912667Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4912762Z def test_silu_mul_quant( 2025-05-07T20:33:50.4912840Z self, 2025-05-07T20:33:50.4912923Z T: int, 2025-05-07T20:33:50.4913002Z D: int, 2025-05-07T20:33:50.4913151Z scale_ub: Optional[float], 2025-05-07T20:33:50.4913246Z contiguous: bool, 2025-05-07T20:33:50.4913337Z compiled: bool, 2025-05-07T20:33:50.4913417Z ) -> None: 2025-05-07T20:33:50.4913518Z torch.manual_seed(2025) 2025-05-07T20:33:50.4913592Z 2025-05-07T20:33:50.4913770Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4913848Z 2025-05-07T20:33:50.4913941Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4914072Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4914163Z x = x_sign * x_clamp 2025-05-07T20:33:50.4914243Z x0 = x[:, :D] 2025-05-07T20:33:50.4914332Z x1 = x[:, D:] 2025-05-07T20:33:50.4914406Z 2025-05-07T20:33:50.4914491Z if contiguous: 2025-05-07T20:33:50.4914589Z x0 = x0.contiguous() 2025-05-07T20:33:50.4914680Z x1 = x1.contiguous() 2025-05-07T20:33:50.4914755Z 2025-05-07T20:33:50.4914850Z if scale_ub is not None: 2025-05-07T20:33:50.4914959Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4915164Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4915248Z ) 2025-05-07T20:33:50.4915323Z else: 2025-05-07T20:33:50.4915422Z scale_ub_tensor = None 2025-05-07T20:33:50.4915495Z 2025-05-07T20:33:50.4915666Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4915758Z op = silu_mul_quant 2025-05-07T20:33:50.4915841Z if compiled: 2025-05-07T20:33:50.4915945Z op = torch.compile(op) 2025-05-07T20:33:50.4916056Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4916127Z 2025-05-07T20:33:50.4916218Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.4916223Z 2025-05-07T20:33:50.4916320Z moe/activation_test.py:117: 2025-05-07T20:33:50.4916451Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4916556Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.4916652Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4917041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.4917178Z return fn(*args, **kwargs) 2025-05-07T20:33:50.4917699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.4917802Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.4918178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4918406Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4918764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4918858Z kernel = self.compile( 2025-05-07T20:33:50.4919258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4919444Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4919577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4919582Z 2025-05-07T20:33:50.4919795Z self = 2025-05-07T20:33:50.4920649Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4921163Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1da439c0>} 2025-05-07T20:33:50.4921953Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4922191Z context = 2025-05-07T20:33:50.4922195Z 2025-05-07T20:33:50.4922372Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4922643Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4922750Z module_map=module_map) 2025-05-07T20:33:50.4922916Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4923016Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.4923093Z E ^ 2025-05-07T20:33:50.4923462Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4923467Z 2025-05-07T20:33:50.4923903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4923910Z 2025-05-07T20:33:50.4924062Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4924292Z self=, 2025-05-07T20:33:50.4924370Z T=1, 2025-05-07T20:33:50.4924452Z D=7168, 2025-05-07T20:33:50.4924575Z scale_ub=None, 2025-05-07T20:33:50.4924666Z contiguous=False, 2025-05-07T20:33:50.4924747Z compiled=True, 2025-05-07T20:33:50.4924819Z ) 2025-05-07T20:33:50.4925042Z self = 2025-05-07T20:33:50.4925206Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:50.4925210Z 2025-05-07T20:33:50.4925288Z @given( 2025-05-07T20:33:50.4925735Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4925853Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4925969Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4926095Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4926212Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4926290Z ) 2025-05-07T20:33:50.4926634Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4926732Z def test_silu_mul_quant( 2025-05-07T20:33:50.4926815Z self, 2025-05-07T20:33:50.4926892Z T: int, 2025-05-07T20:33:50.4926969Z D: int, 2025-05-07T20:33:50.4927069Z scale_ub: Optional[float], 2025-05-07T20:33:50.4927153Z contiguous: bool, 2025-05-07T20:33:50.4927234Z compiled: bool, 2025-05-07T20:33:50.4927313Z ) -> None: 2025-05-07T20:33:50.4927403Z torch.manual_seed(2025) 2025-05-07T20:33:50.4927473Z 2025-05-07T20:33:50.4927647Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4927720Z 2025-05-07T20:33:50.4927806Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4927934Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4928023Z x = x_sign * x_clamp 2025-05-07T20:33:50.4928105Z x0 = x[:, :D] 2025-05-07T20:33:50.4928180Z x1 = x[:, D:] 2025-05-07T20:33:50.4928248Z 2025-05-07T20:33:50.4928332Z if contiguous: 2025-05-07T20:33:50.4928420Z x0 = x0.contiguous() 2025-05-07T20:33:50.4928506Z x1 = x1.contiguous() 2025-05-07T20:33:50.4928576Z 2025-05-07T20:33:50.4928663Z if scale_ub is not None: 2025-05-07T20:33:50.4928762Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4928897Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4928971Z ) 2025-05-07T20:33:50.4929044Z else: 2025-05-07T20:33:50.4929140Z scale_ub_tensor = None 2025-05-07T20:33:50.4929209Z 2025-05-07T20:33:50.4929339Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4929424Z op = silu_mul_quant 2025-05-07T20:33:50.4929570Z if compiled: 2025-05-07T20:33:50.4929674Z op = torch.compile(op) 2025-05-07T20:33:50.4929778Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4929846Z 2025-05-07T20:33:50.4929936Z y_fp8, y_scale = fn() 2025-05-07T20:33:50.4930054Z y = y_fp8.to(torch.float32) * y_scale[:, None] 2025-05-07T20:33:50.4930129Z 2025-05-07T20:33:50.4930293Z def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4930402Z x0_fp32 = x0.to(torch.float32) 2025-05-07T20:33:50.4930515Z x1_fp32 = x1.to(torch.float32) 2025-05-07T20:33:50.4930639Z y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32 2025-05-07T20:33:50.4930777Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4930853Z 2025-05-07T20:33:50.4930951Z > y_fp8_ref, y_scale_ref = ref_fn() 2025-05-07T20:33:50.4930956Z 2025-05-07T20:33:50.4931050Z moe/activation_test.py:126: 2025-05-07T20:33:50.4931248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4931350Z moe/activation_test.py:124: in ref_fn 2025-05-07T20:33:50.4931479Z return triton_quantize_fp8_row(y, scale_ub_tensor) 2025-05-07T20:33:50.4932073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row 2025-05-07T20:33:50.4932232Z _kernel_quantize_fp8_row[grid]( 2025-05-07T20:33:50.4932619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4932843Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4933224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:186: in run 2025-05-07T20:33:50.4933487Z timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs} 2025-05-07T20:33:50.4933882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:166: in _bench 2025-05-07T20:33:50.4934053Z return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8)) 2025-05-07T20:33:50.4934447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/testing.py:117: in do_bench 2025-05-07T20:33:50.4934575Z fn() 2025-05-07T20:33:50.4934998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:152: in kernel_call 2025-05-07T20:33:50.4935075Z self.fn.run( 2025-05-07T20:33:50.4935425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4935516Z kernel = self.compile( 2025-05-07T20:33:50.4935910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4936090Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4936224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4936228Z 2025-05-07T20:33:50.4936436Z self = 2025-05-07T20:33:50.4937245Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4937759Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d968b80>} 2025-05-07T20:33:50.4938542Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4938782Z context = 2025-05-07T20:33:50.4938786Z 2025-05-07T20:33:50.4938955Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4939233Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4939336Z module_map=module_map) 2025-05-07T20:33:50.4939501Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4939599Z E def _kernel_quantize_fp8_row( 2025-05-07T20:33:50.4939670Z E ^ 2025-05-07T20:33:50.4940035Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4940039Z 2025-05-07T20:33:50.4940469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4940473Z 2025-05-07T20:33:50.4940578Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4940849Z self=, 2025-05-07T20:33:50.4940930Z T=1, 2025-05-07T20:33:50.4941008Z D=5120, 2025-05-07T20:33:50.4941092Z scale_ub=1200.0, 2025-05-07T20:33:50.4941179Z contiguous=False, 2025-05-07T20:33:50.4941267Z compiled=True, 2025-05-07T20:33:50.4941382Z ) 2025-05-07T20:33:50.4941603Z self = 2025-05-07T20:33:50.4941772Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:50.4941776Z 2025-05-07T20:33:50.4941852Z @given( 2025-05-07T20:33:50.4941973Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4942072Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4942188Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4942310Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4942423Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4942500Z ) 2025-05-07T20:33:50.4942759Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4942855Z def test_silu_mul_quant( 2025-05-07T20:33:50.4942935Z self, 2025-05-07T20:33:50.4943059Z T: int, 2025-05-07T20:33:50.4943138Z D: int, 2025-05-07T20:33:50.4943243Z scale_ub: Optional[float], 2025-05-07T20:33:50.4943333Z contiguous: bool, 2025-05-07T20:33:50.4943418Z compiled: bool, 2025-05-07T20:33:50.4943496Z ) -> None: 2025-05-07T20:33:50.4943585Z torch.manual_seed(2025) 2025-05-07T20:33:50.4943653Z 2025-05-07T20:33:50.4943822Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4943894Z 2025-05-07T20:33:50.4943981Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4944106Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4944191Z x = x_sign * x_clamp 2025-05-07T20:33:50.4944268Z x0 = x[:, :D] 2025-05-07T20:33:50.4944347Z x1 = x[:, D:] 2025-05-07T20:33:50.4944418Z 2025-05-07T20:33:50.4944498Z if contiguous: 2025-05-07T20:33:50.4944587Z x0 = x0.contiguous() 2025-05-07T20:33:50.4944676Z x1 = x1.contiguous() 2025-05-07T20:33:50.4944751Z 2025-05-07T20:33:50.4944839Z if scale_ub is not None: 2025-05-07T20:33:50.4944942Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4945077Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4945147Z ) 2025-05-07T20:33:50.4945216Z else: 2025-05-07T20:33:50.4945310Z scale_ub_tensor = None 2025-05-07T20:33:50.4945377Z 2025-05-07T20:33:50.4945505Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4945594Z op = silu_mul_quant 2025-05-07T20:33:50.4945675Z if compiled: 2025-05-07T20:33:50.4945769Z op = torch.compile(op) 2025-05-07T20:33:50.4945942Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4946010Z 2025-05-07T20:33:50.4946101Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.4946105Z 2025-05-07T20:33:50.4946196Z moe/activation_test.py:117: 2025-05-07T20:33:50.4946325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4946428Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.4946527Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4946904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.4946996Z return fn(*args, **kwargs) 2025-05-07T20:33:50.4947510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.4947609Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.4947983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4948252Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4948608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4948701Z kernel = self.compile( 2025-05-07T20:33:50.4949140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4949318Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4949443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4949448Z 2025-05-07T20:33:50.4949656Z self = 2025-05-07T20:33:50.4950456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4950972Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d969e40>} 2025-05-07T20:33:50.4951798Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4951993Z context = 2025-05-07T20:33:50.4951997Z 2025-05-07T20:33:50.4952164Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4952430Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4952534Z module_map=module_map) 2025-05-07T20:33:50.4952693Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4952794Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.4952879Z E ^ 2025-05-07T20:33:50.4953248Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4953255Z 2025-05-07T20:33:50.4953683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4953693Z 2025-05-07T20:33:50.4953792Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4954014Z self=, 2025-05-07T20:33:50.4954094Z T=1, 2025-05-07T20:33:50.4954170Z D=5120, 2025-05-07T20:33:50.4954248Z scale_ub=1200.0, 2025-05-07T20:33:50.4954332Z contiguous=False, 2025-05-07T20:33:50.4954414Z compiled=False, 2025-05-07T20:33:50.4954483Z ) 2025-05-07T20:33:50.4954705Z self = 2025-05-07T20:33:50.4954916Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:50.4954924Z 2025-05-07T20:33:50.4955000Z @given( 2025-05-07T20:33:50.4955117Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4955214Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4955329Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4955443Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4955551Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4955627Z ) 2025-05-07T20:33:50.4955874Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4955961Z def test_silu_mul_quant( 2025-05-07T20:33:50.4956041Z self, 2025-05-07T20:33:50.4956113Z T: int, 2025-05-07T20:33:50.4956189Z D: int, 2025-05-07T20:33:50.4956287Z scale_ub: Optional[float], 2025-05-07T20:33:50.4956369Z contiguous: bool, 2025-05-07T20:33:50.4956453Z compiled: bool, 2025-05-07T20:33:50.4956525Z ) -> None: 2025-05-07T20:33:50.4956659Z torch.manual_seed(2025) 2025-05-07T20:33:50.4956737Z 2025-05-07T20:33:50.4956903Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4956974Z 2025-05-07T20:33:50.4957067Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4957230Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4957314Z x = x_sign * x_clamp 2025-05-07T20:33:50.4957390Z x0 = x[:, :D] 2025-05-07T20:33:50.4957466Z x1 = x[:, D:] 2025-05-07T20:33:50.4957536Z 2025-05-07T20:33:50.4957615Z if contiguous: 2025-05-07T20:33:50.4957705Z x0 = x0.contiguous() 2025-05-07T20:33:50.4957790Z x1 = x1.contiguous() 2025-05-07T20:33:50.4957859Z 2025-05-07T20:33:50.4957944Z if scale_ub is not None: 2025-05-07T20:33:50.4958049Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4958182Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4958256Z ) 2025-05-07T20:33:50.4958335Z else: 2025-05-07T20:33:50.4958428Z scale_ub_tensor = None 2025-05-07T20:33:50.4958497Z 2025-05-07T20:33:50.4958668Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4958759Z op = silu_mul_quant 2025-05-07T20:33:50.4958839Z if compiled: 2025-05-07T20:33:50.4958938Z op = torch.compile(op) 2025-05-07T20:33:50.4959039Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4959109Z 2025-05-07T20:33:50.4959199Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.4959204Z 2025-05-07T20:33:50.4959301Z moe/activation_test.py:117: 2025-05-07T20:33:50.4959435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4959533Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.4959631Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4960160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.4960255Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.4960626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4960855Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4961204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4961293Z kernel = self.compile( 2025-05-07T20:33:50.4961689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4961859Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4961987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4962035Z 2025-05-07T20:33:50.4962243Z self = 2025-05-07T20:33:50.4963051Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4963564Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d96aac0>} 2025-05-07T20:33:50.4964343Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4964537Z context = 2025-05-07T20:33:50.4964542Z 2025-05-07T20:33:50.4964708Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4965023Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4965126Z module_map=module_map) 2025-05-07T20:33:50.4965285Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4965425Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.4965497Z E ^ 2025-05-07T20:33:50.4965864Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4965868Z 2025-05-07T20:33:50.4966299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4966304Z 2025-05-07T20:33:50.4966403Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4966628Z self=, 2025-05-07T20:33:50.4966706Z T=16384, 2025-05-07T20:33:50.4966782Z D=5120, 2025-05-07T20:33:50.4966874Z scale_ub=1200.0, 2025-05-07T20:33:50.4966956Z contiguous=False, 2025-05-07T20:33:50.4967040Z compiled=True, 2025-05-07T20:33:50.4967111Z ) 2025-05-07T20:33:50.4967370Z self = 2025-05-07T20:33:50.4967557Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:50.4967561Z 2025-05-07T20:33:50.4967636Z @given( 2025-05-07T20:33:50.4967750Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4967847Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4967958Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4968070Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4968183Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4968254Z ) 2025-05-07T20:33:50.4968507Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4968599Z def test_silu_mul_quant( 2025-05-07T20:33:50.4968675Z self, 2025-05-07T20:33:50.4968755Z T: int, 2025-05-07T20:33:50.4968828Z D: int, 2025-05-07T20:33:50.4968921Z scale_ub: Optional[float], 2025-05-07T20:33:50.4969011Z contiguous: bool, 2025-05-07T20:33:50.4969094Z compiled: bool, 2025-05-07T20:33:50.4969164Z ) -> None: 2025-05-07T20:33:50.4969256Z torch.manual_seed(2025) 2025-05-07T20:33:50.4969324Z 2025-05-07T20:33:50.4969488Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4969562Z 2025-05-07T20:33:50.4969648Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4969771Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4969855Z x = x_sign * x_clamp 2025-05-07T20:33:50.4969932Z x0 = x[:, :D] 2025-05-07T20:33:50.4970014Z x1 = x[:, D:] 2025-05-07T20:33:50.4970081Z 2025-05-07T20:33:50.4970222Z if contiguous: 2025-05-07T20:33:50.4970318Z x0 = x0.contiguous() 2025-05-07T20:33:50.4970410Z x1 = x1.contiguous() 2025-05-07T20:33:50.4970483Z 2025-05-07T20:33:50.4970578Z if scale_ub is not None: 2025-05-07T20:33:50.4970685Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4970821Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4975189Z ) 2025-05-07T20:33:50.4975293Z else: 2025-05-07T20:33:50.4975395Z scale_ub_tensor = None 2025-05-07T20:33:50.4975470Z 2025-05-07T20:33:50.4975605Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4975697Z op = silu_mul_quant 2025-05-07T20:33:50.4975785Z if compiled: 2025-05-07T20:33:50.4975888Z op = torch.compile(op) 2025-05-07T20:33:50.4975995Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4976070Z 2025-05-07T20:33:50.4976163Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.4976173Z 2025-05-07T20:33:50.4976340Z moe/activation_test.py:117: 2025-05-07T20:33:50.4976478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4976577Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.4976677Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4977106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.4977196Z return fn(*args, **kwargs) 2025-05-07T20:33:50.4977714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.4977808Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.4978177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4978403Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4978760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4978854Z kernel = self.compile( 2025-05-07T20:33:50.4979316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4979499Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4979629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4979634Z 2025-05-07T20:33:50.4979842Z self = 2025-05-07T20:33:50.4980646Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4981161Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1dc7c180>} 2025-05-07T20:33:50.4981948Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4982143Z context = 2025-05-07T20:33:50.4982148Z 2025-05-07T20:33:50.4982310Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4982588Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4982691Z module_map=module_map) 2025-05-07T20:33:50.4982848Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4982947Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.4983021Z E ^ 2025-05-07T20:33:50.4983432Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4983440Z 2025-05-07T20:33:50.4983870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4983874Z 2025-05-07T20:33:50.4983977Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4984202Z self=, 2025-05-07T20:33:50.4984281Z T=2048, 2025-05-07T20:33:50.4984355Z D=7168, 2025-05-07T20:33:50.4984440Z scale_ub=1200.0, 2025-05-07T20:33:50.4984523Z contiguous=False, 2025-05-07T20:33:50.4984602Z compiled=True, 2025-05-07T20:33:50.4984681Z ) 2025-05-07T20:33:50.4984899Z self = 2025-05-07T20:33:50.4985075Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:50.4985082Z 2025-05-07T20:33:50.4985157Z @given( 2025-05-07T20:33:50.4985318Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4985427Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4985541Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4985663Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4985819Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4985892Z ) 2025-05-07T20:33:50.4986145Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4986244Z def test_silu_mul_quant( 2025-05-07T20:33:50.4986319Z self, 2025-05-07T20:33:50.4986401Z T: int, 2025-05-07T20:33:50.4986476Z D: int, 2025-05-07T20:33:50.4986575Z scale_ub: Optional[float], 2025-05-07T20:33:50.4986667Z contiguous: bool, 2025-05-07T20:33:50.4986750Z compiled: bool, 2025-05-07T20:33:50.4986827Z ) -> None: 2025-05-07T20:33:50.4986925Z torch.manual_seed(2025) 2025-05-07T20:33:50.4986996Z 2025-05-07T20:33:50.4987166Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.4987241Z 2025-05-07T20:33:50.4987329Z x_sign = torch.sign(x) 2025-05-07T20:33:50.4987491Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.4987583Z x = x_sign * x_clamp 2025-05-07T20:33:50.4987664Z x0 = x[:, :D] 2025-05-07T20:33:50.4987747Z x1 = x[:, D:] 2025-05-07T20:33:50.4987814Z 2025-05-07T20:33:50.4987894Z if contiguous: 2025-05-07T20:33:50.4987986Z x0 = x0.contiguous() 2025-05-07T20:33:50.4988071Z x1 = x1.contiguous() 2025-05-07T20:33:50.4988142Z 2025-05-07T20:33:50.4988232Z if scale_ub is not None: 2025-05-07T20:33:50.4988332Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.4988464Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.4988540Z ) 2025-05-07T20:33:50.4988615Z else: 2025-05-07T20:33:50.4988708Z scale_ub_tensor = None 2025-05-07T20:33:50.4988784Z 2025-05-07T20:33:50.4988909Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.4989005Z op = silu_mul_quant 2025-05-07T20:33:50.4989094Z if compiled: 2025-05-07T20:33:50.4989198Z op = torch.compile(op) 2025-05-07T20:33:50.4989307Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4989382Z 2025-05-07T20:33:50.4989475Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.4989479Z 2025-05-07T20:33:50.4989580Z moe/activation_test.py:117: 2025-05-07T20:33:50.4989712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4989814Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.4989919Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.4990301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.4990442Z return fn(*args, **kwargs) 2025-05-07T20:33:50.4990961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.4991057Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.4991431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.4991659Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.4992015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.4992107Z kernel = self.compile( 2025-05-07T20:33:50.4992503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.4992681Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.4992811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.4992856Z 2025-05-07T20:33:50.4993069Z self = 2025-05-07T20:33:50.4993875Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.4994428Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1dc7cea0>} 2025-05-07T20:33:50.4995217Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.4995408Z context = 2025-05-07T20:33:50.4995414Z 2025-05-07T20:33:50.4995586Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.4995852Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.4996000Z module_map=module_map) 2025-05-07T20:33:50.4996168Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.4996268Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.4996343Z E ^ 2025-05-07T20:33:50.4996715Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.4996720Z 2025-05-07T20:33:50.4997150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.4997155Z 2025-05-07T20:33:50.4997256Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.4997479Z self=, 2025-05-07T20:33:50.4997556Z T=1, 2025-05-07T20:33:50.4997637Z D=5120, 2025-05-07T20:33:50.4997717Z scale_ub=None, 2025-05-07T20:33:50.4997801Z contiguous=False, 2025-05-07T20:33:50.4997886Z compiled=False, 2025-05-07T20:33:50.4997959Z ) 2025-05-07T20:33:50.4998180Z self = 2025-05-07T20:33:50.4998352Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:50.4998356Z 2025-05-07T20:33:50.4998432Z @given( 2025-05-07T20:33:50.4998554Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.4998648Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.4998757Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.4998872Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.4998980Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.4999053Z ) 2025-05-07T20:33:50.4999350Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.4999446Z def test_silu_mul_quant( 2025-05-07T20:33:50.4999518Z self, 2025-05-07T20:33:50.4999598Z T: int, 2025-05-07T20:33:50.4999671Z D: int, 2025-05-07T20:33:50.4999796Z scale_ub: Optional[float], 2025-05-07T20:33:50.4999889Z contiguous: bool, 2025-05-07T20:33:50.4999990Z compiled: bool, 2025-05-07T20:33:50.5000068Z ) -> None: 2025-05-07T20:33:50.5000158Z torch.manual_seed(2025) 2025-05-07T20:33:50.5000228Z 2025-05-07T20:33:50.5000403Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5000475Z 2025-05-07T20:33:50.5000561Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5000688Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5000776Z x = x_sign * x_clamp 2025-05-07T20:33:50.5000855Z x0 = x[:, :D] 2025-05-07T20:33:50.5000935Z x1 = x[:, D:] 2025-05-07T20:33:50.5001007Z 2025-05-07T20:33:50.5001134Z if contiguous: 2025-05-07T20:33:50.5001223Z x0 = x0.contiguous() 2025-05-07T20:33:50.5001311Z x1 = x1.contiguous() 2025-05-07T20:33:50.5001385Z 2025-05-07T20:33:50.5001473Z if scale_ub is not None: 2025-05-07T20:33:50.5001576Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5001752Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5001826Z ) 2025-05-07T20:33:50.5001903Z else: 2025-05-07T20:33:50.5001996Z scale_ub_tensor = None 2025-05-07T20:33:50.5002065Z 2025-05-07T20:33:50.5002194Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5002284Z op = silu_mul_quant 2025-05-07T20:33:50.5002364Z if compiled: 2025-05-07T20:33:50.5002463Z op = torch.compile(op) 2025-05-07T20:33:50.5002564Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5002636Z 2025-05-07T20:33:50.5002726Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5002734Z 2025-05-07T20:33:50.5002832Z moe/activation_test.py:117: 2025-05-07T20:33:50.5002960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5003102Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5003202Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5003723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5003821Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5004192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5004419Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5004774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5004864Z kernel = self.compile( 2025-05-07T20:33:50.5005265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5005441Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5005571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5005578Z 2025-05-07T20:33:50.5005782Z self = 2025-05-07T20:33:50.5006585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5007097Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1dc7de40>} 2025-05-07T20:33:50.5007927Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5008129Z context = 2025-05-07T20:33:50.5008135Z 2025-05-07T20:33:50.5008298Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5008569Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5008672Z module_map=module_map) 2025-05-07T20:33:50.5008831Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5008930Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5009005Z E ^ 2025-05-07T20:33:50.5009368Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5009375Z 2025-05-07T20:33:50.5009851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5009856Z 2025-05-07T20:33:50.5009956Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5010206Z self=, 2025-05-07T20:33:50.5010373Z T=4096, 2025-05-07T20:33:50.5010465Z D=7168, 2025-05-07T20:33:50.5010550Z scale_ub=1200.0, 2025-05-07T20:33:50.5010634Z contiguous=False, 2025-05-07T20:33:50.5010716Z compiled=False, 2025-05-07T20:33:50.5010795Z ) 2025-05-07T20:33:50.5011017Z self = 2025-05-07T20:33:50.5011201Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:50.5011205Z 2025-05-07T20:33:50.5011286Z @given( 2025-05-07T20:33:50.5011404Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5011508Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5011630Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5011745Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5011902Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5011979Z ) 2025-05-07T20:33:50.5012233Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5012327Z def test_silu_mul_quant( 2025-05-07T20:33:50.5012405Z self, 2025-05-07T20:33:50.5012484Z T: int, 2025-05-07T20:33:50.5012560Z D: int, 2025-05-07T20:33:50.5012654Z scale_ub: Optional[float], 2025-05-07T20:33:50.5012740Z contiguous: bool, 2025-05-07T20:33:50.5012824Z compiled: bool, 2025-05-07T20:33:50.5012896Z ) -> None: 2025-05-07T20:33:50.5012986Z torch.manual_seed(2025) 2025-05-07T20:33:50.5013060Z 2025-05-07T20:33:50.5013230Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5013308Z 2025-05-07T20:33:50.5013398Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5013518Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5013615Z x = x_sign * x_clamp 2025-05-07T20:33:50.5013696Z x0 = x[:, :D] 2025-05-07T20:33:50.5013775Z x1 = x[:, D:] 2025-05-07T20:33:50.5013853Z 2025-05-07T20:33:50.5013934Z if contiguous: 2025-05-07T20:33:50.5014027Z x0 = x0.contiguous() 2025-05-07T20:33:50.5014120Z x1 = x1.contiguous() 2025-05-07T20:33:50.5014192Z 2025-05-07T20:33:50.5014282Z if scale_ub is not None: 2025-05-07T20:33:50.5014385Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5014587Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5014663Z ) 2025-05-07T20:33:50.5014737Z else: 2025-05-07T20:33:50.5014828Z scale_ub_tensor = None 2025-05-07T20:33:50.5014953Z 2025-05-07T20:33:50.5015079Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5015172Z op = silu_mul_quant 2025-05-07T20:33:50.5015258Z if compiled: 2025-05-07T20:33:50.5015353Z op = torch.compile(op) 2025-05-07T20:33:50.5015456Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5015531Z 2025-05-07T20:33:50.5015618Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5015622Z 2025-05-07T20:33:50.5015721Z moe/activation_test.py:117: 2025-05-07T20:33:50.5015849Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5015946Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5016046Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5016567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5016662Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5017081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5017308Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5017674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5017805Z kernel = self.compile( 2025-05-07T20:33:50.5018202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5018380Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5018506Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5018511Z 2025-05-07T20:33:50.5018714Z self = 2025-05-07T20:33:50.5019526Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5020080Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1dc7f380>} 2025-05-07T20:33:50.5020872Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5021063Z context = 2025-05-07T20:33:50.5021067Z 2025-05-07T20:33:50.5021245Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5021516Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5021624Z module_map=module_map) 2025-05-07T20:33:50.5021794Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5021894Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5021970Z E ^ 2025-05-07T20:33:50.5022346Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5022353Z 2025-05-07T20:33:50.5022785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5022789Z 2025-05-07T20:33:50.5022893Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5023116Z self=, 2025-05-07T20:33:50.5023190Z T=16384, 2025-05-07T20:33:50.5023268Z D=7168, 2025-05-07T20:33:50.5023346Z scale_ub=None, 2025-05-07T20:33:50.5023431Z contiguous=True, 2025-05-07T20:33:50.5023511Z compiled=True, 2025-05-07T20:33:50.5023581Z ) 2025-05-07T20:33:50.5023849Z self = 2025-05-07T20:33:50.5024025Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:50.5024030Z 2025-05-07T20:33:50.5024106Z @given( 2025-05-07T20:33:50.5024230Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5024328Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5024439Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5024557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5024667Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5024738Z ) 2025-05-07T20:33:50.5024988Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5025078Z def test_silu_mul_quant( 2025-05-07T20:33:50.5025159Z self, 2025-05-07T20:33:50.5025232Z T: int, 2025-05-07T20:33:50.5025307Z D: int, 2025-05-07T20:33:50.5025603Z scale_ub: Optional[float], 2025-05-07T20:33:50.5025738Z contiguous: bool, 2025-05-07T20:33:50.5025950Z compiled: bool, 2025-05-07T20:33:50.5026034Z ) -> None: 2025-05-07T20:33:50.5026124Z torch.manual_seed(2025) 2025-05-07T20:33:50.5026199Z 2025-05-07T20:33:50.5026378Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5026511Z 2025-05-07T20:33:50.5026599Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5026725Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5026811Z x = x_sign * x_clamp 2025-05-07T20:33:50.5026892Z x0 = x[:, :D] 2025-05-07T20:33:50.5026971Z x1 = x[:, D:] 2025-05-07T20:33:50.5027044Z 2025-05-07T20:33:50.5027129Z if contiguous: 2025-05-07T20:33:50.5027220Z x0 = x0.contiguous() 2025-05-07T20:33:50.5027306Z x1 = x1.contiguous() 2025-05-07T20:33:50.5027379Z 2025-05-07T20:33:50.5027469Z if scale_ub is not None: 2025-05-07T20:33:50.5027573Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5027712Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5027784Z ) 2025-05-07T20:33:50.5027858Z else: 2025-05-07T20:33:50.5028015Z scale_ub_tensor = None 2025-05-07T20:33:50.5028089Z 2025-05-07T20:33:50.5028220Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5028318Z op = silu_mul_quant 2025-05-07T20:33:50.5028404Z if compiled: 2025-05-07T20:33:50.5028509Z op = torch.compile(op) 2025-05-07T20:33:50.5028614Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5028689Z 2025-05-07T20:33:50.5028783Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5028788Z 2025-05-07T20:33:50.5028885Z moe/activation_test.py:117: 2025-05-07T20:33:50.5029015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5029118Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5029220Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5029609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.5029701Z return fn(*args, **kwargs) 2025-05-07T20:33:50.5030219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5030322Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5030694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5030917Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5031277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5031368Z kernel = self.compile( 2025-05-07T20:33:50.5031766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5032008Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5032142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5032146Z 2025-05-07T20:33:50.5032356Z self = 2025-05-07T20:33:50.5033168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5033680Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d6b44a0>} 2025-05-07T20:33:50.5034508Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5034702Z context = 2025-05-07T20:33:50.5034712Z 2025-05-07T20:33:50.5034880Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5035192Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5035300Z module_map=module_map) 2025-05-07T20:33:50.5035460Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5035558Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5035635Z E ^ 2025-05-07T20:33:50.5036002Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5036007Z 2025-05-07T20:33:50.5036438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5036447Z 2025-05-07T20:33:50.5036554Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5036786Z self=, 2025-05-07T20:33:50.5036902Z T=4096, 2025-05-07T20:33:50.5036979Z D=5120, 2025-05-07T20:33:50.5037066Z scale_ub=None, 2025-05-07T20:33:50.5037153Z contiguous=False, 2025-05-07T20:33:50.5037237Z compiled=True, 2025-05-07T20:33:50.5037317Z ) 2025-05-07T20:33:50.5037539Z self = 2025-05-07T20:33:50.5037720Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:50.5037725Z 2025-05-07T20:33:50.5037806Z @given( 2025-05-07T20:33:50.5037925Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5038029Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5038144Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5038264Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5038383Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5038458Z ) 2025-05-07T20:33:50.5038711Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5038808Z def test_silu_mul_quant( 2025-05-07T20:33:50.5038890Z self, 2025-05-07T20:33:50.5038970Z T: int, 2025-05-07T20:33:50.5039043Z D: int, 2025-05-07T20:33:50.5039139Z scale_ub: Optional[float], 2025-05-07T20:33:50.5039228Z contiguous: bool, 2025-05-07T20:33:50.5039312Z compiled: bool, 2025-05-07T20:33:50.5039386Z ) -> None: 2025-05-07T20:33:50.5039480Z torch.manual_seed(2025) 2025-05-07T20:33:50.5039548Z 2025-05-07T20:33:50.5039714Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5039791Z 2025-05-07T20:33:50.5039880Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5040048Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5040145Z x = x_sign * x_clamp 2025-05-07T20:33:50.5040221Z x0 = x[:, :D] 2025-05-07T20:33:50.5040298Z x1 = x[:, D:] 2025-05-07T20:33:50.5040374Z 2025-05-07T20:33:50.5040456Z if contiguous: 2025-05-07T20:33:50.5040546Z x0 = x0.contiguous() 2025-05-07T20:33:50.5040639Z x1 = x1.contiguous() 2025-05-07T20:33:50.5040709Z 2025-05-07T20:33:50.5040806Z if scale_ub is not None: 2025-05-07T20:33:50.5040909Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5041039Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5041114Z ) 2025-05-07T20:33:50.5041189Z else: 2025-05-07T20:33:50.5041280Z scale_ub_tensor = None 2025-05-07T20:33:50.5041353Z 2025-05-07T20:33:50.5041478Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5041566Z op = silu_mul_quant 2025-05-07T20:33:50.5041657Z if compiled: 2025-05-07T20:33:50.5041825Z op = torch.compile(op) 2025-05-07T20:33:50.5041936Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5042004Z 2025-05-07T20:33:50.5042093Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5042098Z 2025-05-07T20:33:50.5042233Z moe/activation_test.py:117: 2025-05-07T20:33:50.5042360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5042457Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5042557Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5042940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.5043029Z return fn(*args, **kwargs) 2025-05-07T20:33:50.5043548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5043648Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5044026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5044295Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5044649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5044749Z kernel = self.compile( 2025-05-07T20:33:50.5045147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5045326Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5045453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5045457Z 2025-05-07T20:33:50.5045665Z self = 2025-05-07T20:33:50.5046475Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5046992Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d6b51c0>} 2025-05-07T20:33:50.5047782Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5047974Z context = 2025-05-07T20:33:50.5047978Z 2025-05-07T20:33:50.5048144Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5048417Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5048565Z module_map=module_map) 2025-05-07T20:33:50.5048729Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5048828Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5048906Z E ^ 2025-05-07T20:33:50.5049280Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5049287Z 2025-05-07T20:33:50.5049721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5049726Z 2025-05-07T20:33:50.5049828Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5050052Z self=, 2025-05-07T20:33:50.5050124Z T=4096, 2025-05-07T20:33:50.5050202Z D=5120, 2025-05-07T20:33:50.5050286Z scale_ub=1200.0, 2025-05-07T20:33:50.5050371Z contiguous=False, 2025-05-07T20:33:50.5050453Z compiled=False, 2025-05-07T20:33:50.5050527Z ) 2025-05-07T20:33:50.5050789Z self = 2025-05-07T20:33:50.5050971Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:50.5050976Z 2025-05-07T20:33:50.5051058Z @given( 2025-05-07T20:33:50.5051178Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5051319Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5051436Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5051557Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5051671Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5051749Z ) 2025-05-07T20:33:50.5052007Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5052101Z def test_silu_mul_quant( 2025-05-07T20:33:50.5052179Z self, 2025-05-07T20:33:50.5052264Z T: int, 2025-05-07T20:33:50.5052344Z D: int, 2025-05-07T20:33:50.5052450Z scale_ub: Optional[float], 2025-05-07T20:33:50.5052544Z contiguous: bool, 2025-05-07T20:33:50.5052629Z compiled: bool, 2025-05-07T20:33:50.5052707Z ) -> None: 2025-05-07T20:33:50.5052840Z torch.manual_seed(2025) 2025-05-07T20:33:50.5052912Z 2025-05-07T20:33:50.5053085Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5053158Z 2025-05-07T20:33:50.5053248Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5053378Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5053464Z x = x_sign * x_clamp 2025-05-07T20:33:50.5053540Z x0 = x[:, :D] 2025-05-07T20:33:50.5053619Z x1 = x[:, D:] 2025-05-07T20:33:50.5053690Z 2025-05-07T20:33:50.5053772Z if contiguous: 2025-05-07T20:33:50.5053863Z x0 = x0.contiguous() 2025-05-07T20:33:50.5053949Z x1 = x1.contiguous() 2025-05-07T20:33:50.5054025Z 2025-05-07T20:33:50.5054112Z if scale_ub is not None: 2025-05-07T20:33:50.5054217Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5054355Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5054428Z ) 2025-05-07T20:33:50.5054597Z else: 2025-05-07T20:33:50.5054693Z scale_ub_tensor = None 2025-05-07T20:33:50.5054766Z 2025-05-07T20:33:50.5054892Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5054982Z op = silu_mul_quant 2025-05-07T20:33:50.5055067Z if compiled: 2025-05-07T20:33:50.5055162Z op = torch.compile(op) 2025-05-07T20:33:50.5055270Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5055343Z 2025-05-07T20:33:50.5055436Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5055441Z 2025-05-07T20:33:50.5055534Z moe/activation_test.py:117: 2025-05-07T20:33:50.5055665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5055819Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5055919Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5056442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5056544Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5056919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5057149Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5057504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5057595Z kernel = self.compile( 2025-05-07T20:33:50.5057999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5058175Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5058348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5058359Z 2025-05-07T20:33:50.5058570Z self = 2025-05-07T20:33:50.5059372Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5059925Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d6b6160>} 2025-05-07T20:33:50.5060708Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5060907Z context = 2025-05-07T20:33:50.5060915Z 2025-05-07T20:33:50.5061079Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5061389Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5061502Z module_map=module_map) 2025-05-07T20:33:50.5061661Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5061759Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5061833Z E ^ 2025-05-07T20:33:50.5062198Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5062203Z 2025-05-07T20:33:50.5062637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5062642Z 2025-05-07T20:33:50.5062744Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5062976Z self=, 2025-05-07T20:33:50.5063055Z T=4096, 2025-05-07T20:33:50.5063129Z D=5120, 2025-05-07T20:33:50.5063209Z scale_ub=1200.0, 2025-05-07T20:33:50.5063295Z contiguous=False, 2025-05-07T20:33:50.5063375Z compiled=True, 2025-05-07T20:33:50.5063460Z ) 2025-05-07T20:33:50.5063680Z self = 2025-05-07T20:33:50.5063855Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:50.5063860Z 2025-05-07T20:33:50.5063936Z @given( 2025-05-07T20:33:50.5064052Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5064147Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5064260Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5064374Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5064488Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5064604Z ) 2025-05-07T20:33:50.5064856Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5064949Z def test_silu_mul_quant( 2025-05-07T20:33:50.5065024Z self, 2025-05-07T20:33:50.5065102Z T: int, 2025-05-07T20:33:50.5065180Z D: int, 2025-05-07T20:33:50.5065276Z scale_ub: Optional[float], 2025-05-07T20:33:50.5065364Z contiguous: bool, 2025-05-07T20:33:50.5065451Z compiled: bool, 2025-05-07T20:33:50.5065526Z ) -> None: 2025-05-07T20:33:50.5065618Z torch.manual_seed(2025) 2025-05-07T20:33:50.5065694Z 2025-05-07T20:33:50.5065862Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5065938Z 2025-05-07T20:33:50.5066027Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5066150Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5066238Z x = x_sign * x_clamp 2025-05-07T20:33:50.5066320Z x0 = x[:, :D] 2025-05-07T20:33:50.5066441Z x1 = x[:, D:] 2025-05-07T20:33:50.5066518Z 2025-05-07T20:33:50.5066599Z if contiguous: 2025-05-07T20:33:50.5066688Z x0 = x0.contiguous() 2025-05-07T20:33:50.5066782Z x1 = x1.contiguous() 2025-05-07T20:33:50.5066851Z 2025-05-07T20:33:50.5066979Z if scale_ub is not None: 2025-05-07T20:33:50.5067086Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5067221Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5067300Z ) 2025-05-07T20:33:50.5067373Z else: 2025-05-07T20:33:50.5067464Z scale_ub_tensor = None 2025-05-07T20:33:50.5067542Z 2025-05-07T20:33:50.5067668Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5067754Z op = silu_mul_quant 2025-05-07T20:33:50.5067839Z if compiled: 2025-05-07T20:33:50.5067935Z op = torch.compile(op) 2025-05-07T20:33:50.5068041Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5068113Z 2025-05-07T20:33:50.5068203Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5068207Z 2025-05-07T20:33:50.5068303Z moe/activation_test.py:117: 2025-05-07T20:33:50.5068476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5068577Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5068678Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5069059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.5069149Z return fn(*args, **kwargs) 2025-05-07T20:33:50.5069670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5069764Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5070138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5070376Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5070732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5070825Z kernel = self.compile( 2025-05-07T20:33:50.5071224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5071398Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5071532Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5071536Z 2025-05-07T20:33:50.5071745Z self = 2025-05-07T20:33:50.5072550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5073135Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d6b7240>} 2025-05-07T20:33:50.5073922Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5074121Z context = 2025-05-07T20:33:50.5074126Z 2025-05-07T20:33:50.5074294Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5074566Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5074671Z module_map=module_map) 2025-05-07T20:33:50.5074831Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5074974Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5075054Z E ^ 2025-05-07T20:33:50.5075431Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5075436Z 2025-05-07T20:33:50.5075866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5075910Z 2025-05-07T20:33:50.5076014Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5076249Z self=, 2025-05-07T20:33:50.5076326Z T=2048, 2025-05-07T20:33:50.5076401Z D=7168, 2025-05-07T20:33:50.5076486Z scale_ub=1200.0, 2025-05-07T20:33:50.5076572Z contiguous=False, 2025-05-07T20:33:50.5076656Z compiled=False, 2025-05-07T20:33:50.5076732Z ) 2025-05-07T20:33:50.5076952Z self = 2025-05-07T20:33:50.5077135Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:50.5077140Z 2025-05-07T20:33:50.5077215Z @given( 2025-05-07T20:33:50.5077333Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5077476Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5077592Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5077706Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5077820Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5077892Z ) 2025-05-07T20:33:50.5078144Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5078234Z def test_silu_mul_quant( 2025-05-07T20:33:50.5078307Z self, 2025-05-07T20:33:50.5078384Z T: int, 2025-05-07T20:33:50.5078459Z D: int, 2025-05-07T20:33:50.5078554Z scale_ub: Optional[float], 2025-05-07T20:33:50.5078649Z contiguous: bool, 2025-05-07T20:33:50.5078732Z compiled: bool, 2025-05-07T20:33:50.5078810Z ) -> None: 2025-05-07T20:33:50.5078905Z torch.manual_seed(2025) 2025-05-07T20:33:50.5078979Z 2025-05-07T20:33:50.5079151Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5079223Z 2025-05-07T20:33:50.5079315Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5079443Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5079528Z x = x_sign * x_clamp 2025-05-07T20:33:50.5079606Z x0 = x[:, :D] 2025-05-07T20:33:50.5079687Z x1 = x[:, D:] 2025-05-07T20:33:50.5079757Z 2025-05-07T20:33:50.5079840Z if contiguous: 2025-05-07T20:33:50.5079932Z x0 = x0.contiguous() 2025-05-07T20:33:50.5080018Z x1 = x1.contiguous() 2025-05-07T20:33:50.5080087Z 2025-05-07T20:33:50.5080176Z if scale_ub is not None: 2025-05-07T20:33:50.5080279Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5080461Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5080540Z ) 2025-05-07T20:33:50.5080613Z else: 2025-05-07T20:33:50.5080708Z scale_ub_tensor = None 2025-05-07T20:33:50.5080779Z 2025-05-07T20:33:50.5080908Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5080999Z op = silu_mul_quant 2025-05-07T20:33:50.5081080Z if compiled: 2025-05-07T20:33:50.5081177Z op = torch.compile(op) 2025-05-07T20:33:50.5081282Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5081353Z 2025-05-07T20:33:50.5081440Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5081445Z 2025-05-07T20:33:50.5081539Z moe/activation_test.py:117: 2025-05-07T20:33:50.5081665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5081762Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5081862Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5082430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5082533Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5082911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5083175Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5083534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5083623Z kernel = self.compile( 2025-05-07T20:33:50.5084023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5084197Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5084323Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5084331Z 2025-05-07T20:33:50.5084546Z self = 2025-05-07T20:33:50.5085388Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5085906Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d37c220>} 2025-05-07T20:33:50.5086690Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5086883Z context = 2025-05-07T20:33:50.5086890Z 2025-05-07T20:33:50.5087061Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5087332Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5087442Z module_map=module_map) 2025-05-07T20:33:50.5087602Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5087702Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5087779Z E ^ 2025-05-07T20:33:50.5088143Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5088147Z 2025-05-07T20:33:50.5088580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5088585Z 2025-05-07T20:33:50.5088685Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5088911Z self=, 2025-05-07T20:33:50.5089033Z T=1, 2025-05-07T20:33:50.5089114Z D=7168, 2025-05-07T20:33:50.5089200Z scale_ub=None, 2025-05-07T20:33:50.5089290Z contiguous=True, 2025-05-07T20:33:50.5089376Z compiled=False, 2025-05-07T20:33:50.5089451Z ) 2025-05-07T20:33:50.5089680Z self = 2025-05-07T20:33:50.5089852Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:50.5089857Z 2025-05-07T20:33:50.5089940Z @given( 2025-05-07T20:33:50.5090062Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5090161Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5090279Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5090397Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5090510Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5090590Z ) 2025-05-07T20:33:50.5090841Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5090981Z def test_silu_mul_quant( 2025-05-07T20:33:50.5091069Z self, 2025-05-07T20:33:50.5091144Z T: int, 2025-05-07T20:33:50.5091224Z D: int, 2025-05-07T20:33:50.5091324Z scale_ub: Optional[float], 2025-05-07T20:33:50.5091411Z contiguous: bool, 2025-05-07T20:33:50.5091540Z compiled: bool, 2025-05-07T20:33:50.5091615Z ) -> None: 2025-05-07T20:33:50.5091706Z torch.manual_seed(2025) 2025-05-07T20:33:50.5091780Z 2025-05-07T20:33:50.5091948Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5092026Z 2025-05-07T20:33:50.5092118Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5092240Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5092326Z x = x_sign * x_clamp 2025-05-07T20:33:50.5092405Z x0 = x[:, :D] 2025-05-07T20:33:50.5092479Z x1 = x[:, D:] 2025-05-07T20:33:50.5092554Z 2025-05-07T20:33:50.5092636Z if contiguous: 2025-05-07T20:33:50.5092731Z x0 = x0.contiguous() 2025-05-07T20:33:50.5092821Z x1 = x1.contiguous() 2025-05-07T20:33:50.5092891Z 2025-05-07T20:33:50.5092979Z if scale_ub is not None: 2025-05-07T20:33:50.5093130Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5093266Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5093338Z ) 2025-05-07T20:33:50.5093417Z else: 2025-05-07T20:33:50.5093507Z scale_ub_tensor = None 2025-05-07T20:33:50.5093579Z 2025-05-07T20:33:50.5093711Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5093798Z op = silu_mul_quant 2025-05-07T20:33:50.5093881Z if compiled: 2025-05-07T20:33:50.5093980Z op = torch.compile(op) 2025-05-07T20:33:50.5094082Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5094156Z 2025-05-07T20:33:50.5094245Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5094249Z 2025-05-07T20:33:50.5094347Z moe/activation_test.py:117: 2025-05-07T20:33:50.5094478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5094646Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5094743Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5099487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5099609Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5099996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5100228Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5100592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5100761Z kernel = self.compile( 2025-05-07T20:33:50.5101174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5101355Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5101490Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5101502Z 2025-05-07T20:33:50.5101714Z self = 2025-05-07T20:33:50.5102523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5103042Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d37d120>} 2025-05-07T20:33:50.5103883Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5104090Z context = 2025-05-07T20:33:50.5104095Z 2025-05-07T20:33:50.5104304Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5104577Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5104687Z module_map=module_map) 2025-05-07T20:33:50.5104849Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5104952Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5105028Z E ^ 2025-05-07T20:33:50.5105396Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5105401Z 2025-05-07T20:33:50.5105842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5105846Z 2025-05-07T20:33:50.5105951Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5106243Z self=, 2025-05-07T20:33:50.5106336Z T=16384, 2025-05-07T20:33:50.5106416Z D=7168, 2025-05-07T20:33:50.5106506Z scale_ub=1200.0, 2025-05-07T20:33:50.5106595Z contiguous=False, 2025-05-07T20:33:50.5106682Z compiled=True, 2025-05-07T20:33:50.5106761Z ) 2025-05-07T20:33:50.5106988Z self = 2025-05-07T20:33:50.5107173Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:50.5107177Z 2025-05-07T20:33:50.5107268Z @given( 2025-05-07T20:33:50.5107391Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5107494Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5107619Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5107741Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5107861Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5107941Z ) 2025-05-07T20:33:50.5108198Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5108300Z def test_silu_mul_quant( 2025-05-07T20:33:50.5108382Z self, 2025-05-07T20:33:50.5108463Z T: int, 2025-05-07T20:33:50.5108549Z D: int, 2025-05-07T20:33:50.5108651Z scale_ub: Optional[float], 2025-05-07T20:33:50.5108744Z contiguous: bool, 2025-05-07T20:33:50.5108837Z compiled: bool, 2025-05-07T20:33:50.5108914Z ) -> None: 2025-05-07T20:33:50.5109013Z torch.manual_seed(2025) 2025-05-07T20:33:50.5109089Z 2025-05-07T20:33:50.5109269Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5109406Z 2025-05-07T20:33:50.5109500Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5109628Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5109723Z x = x_sign * x_clamp 2025-05-07T20:33:50.5109808Z x0 = x[:, :D] 2025-05-07T20:33:50.5109894Z x1 = x[:, D:] 2025-05-07T20:33:50.5109976Z 2025-05-07T20:33:50.5110068Z if contiguous: 2025-05-07T20:33:50.5110163Z x0 = x0.contiguous() 2025-05-07T20:33:50.5110264Z x1 = x1.contiguous() 2025-05-07T20:33:50.5110344Z 2025-05-07T20:33:50.5110443Z if scale_ub is not None: 2025-05-07T20:33:50.5110558Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5110696Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5110781Z ) 2025-05-07T20:33:50.5110864Z else: 2025-05-07T20:33:50.5110962Z scale_ub_tensor = None 2025-05-07T20:33:50.5111042Z 2025-05-07T20:33:50.5111171Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5111270Z op = silu_mul_quant 2025-05-07T20:33:50.5111404Z if compiled: 2025-05-07T20:33:50.5111513Z op = torch.compile(op) 2025-05-07T20:33:50.5111626Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5111705Z 2025-05-07T20:33:50.5111839Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5111844Z 2025-05-07T20:33:50.5111944Z moe/activation_test.py:117: 2025-05-07T20:33:50.5112081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5112185Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5112292Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5112680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.5112774Z return fn(*args, **kwargs) 2025-05-07T20:33:50.5113295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5113400Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5113775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5114047Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5114406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5114502Z kernel = self.compile( 2025-05-07T20:33:50.5114906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5115082Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5115216Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5115220Z 2025-05-07T20:33:50.5115429Z self = 2025-05-07T20:33:50.5116253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5116766Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d37e520>} 2025-05-07T20:33:50.5117558Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5117760Z context = 2025-05-07T20:33:50.5117764Z 2025-05-07T20:33:50.5117930Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5118206Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5118360Z module_map=module_map) 2025-05-07T20:33:50.5118522Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5118632Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5118714Z E ^ 2025-05-07T20:33:50.5119089Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5119094Z 2025-05-07T20:33:50.5119527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5119531Z 2025-05-07T20:33:50.5119633Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5119864Z self=, 2025-05-07T20:33:50.5119943Z T=1, 2025-05-07T20:33:50.5120022Z D=7168, 2025-05-07T20:33:50.5120107Z scale_ub=None, 2025-05-07T20:33:50.5120207Z contiguous=False, 2025-05-07T20:33:50.5120308Z compiled=False, 2025-05-07T20:33:50.5120447Z ) 2025-05-07T20:33:50.5120674Z self = 2025-05-07T20:33:50.5120849Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:50.5120893Z 2025-05-07T20:33:50.5120975Z @given( 2025-05-07T20:33:50.5121097Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5121205Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5121321Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5121439Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5121557Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5121636Z ) 2025-05-07T20:33:50.5121890Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5121985Z def test_silu_mul_quant( 2025-05-07T20:33:50.5122069Z self, 2025-05-07T20:33:50.5122153Z T: int, 2025-05-07T20:33:50.5122236Z D: int, 2025-05-07T20:33:50.5122339Z scale_ub: Optional[float], 2025-05-07T20:33:50.5122437Z contiguous: bool, 2025-05-07T20:33:50.5122526Z compiled: bool, 2025-05-07T20:33:50.5122647Z ) -> None: 2025-05-07T20:33:50.5122748Z torch.manual_seed(2025) 2025-05-07T20:33:50.5122826Z 2025-05-07T20:33:50.5122996Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5123075Z 2025-05-07T20:33:50.5123166Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5123293Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5123380Z x = x_sign * x_clamp 2025-05-07T20:33:50.5123462Z x0 = x[:, :D] 2025-05-07T20:33:50.5123545Z x1 = x[:, D:] 2025-05-07T20:33:50.5123619Z 2025-05-07T20:33:50.5123703Z if contiguous: 2025-05-07T20:33:50.5123804Z x0 = x0.contiguous() 2025-05-07T20:33:50.5123899Z x1 = x1.contiguous() 2025-05-07T20:33:50.5123973Z 2025-05-07T20:33:50.5124072Z if scale_ub is not None: 2025-05-07T20:33:50.5124178Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5124316Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5124395Z ) 2025-05-07T20:33:50.5124473Z else: 2025-05-07T20:33:50.5124567Z scale_ub_tensor = None 2025-05-07T20:33:50.5124643Z 2025-05-07T20:33:50.5124771Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5124870Z op = silu_mul_quant 2025-05-07T20:33:50.5124959Z if compiled: 2025-05-07T20:33:50.5125060Z op = torch.compile(op) 2025-05-07T20:33:50.5125167Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5125242Z 2025-05-07T20:33:50.5125334Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5125338Z 2025-05-07T20:33:50.5125777Z moe/activation_test.py:117: 2025-05-07T20:33:50.5126021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5126121Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5126223Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5126746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5126846Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5127218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5127445Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5127800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5127889Z kernel = self.compile( 2025-05-07T20:33:50.5128295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5128539Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5128666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5128670Z 2025-05-07T20:33:50.5128881Z self = 2025-05-07T20:33:50.5129744Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5130310Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d37f100>} 2025-05-07T20:33:50.5131093Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5131292Z context = 2025-05-07T20:33:50.5131296Z 2025-05-07T20:33:50.5131523Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5131791Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5131899Z module_map=module_map) 2025-05-07T20:33:50.5132063Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5132160Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5132234Z E ^ 2025-05-07T20:33:50.5132600Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5132604Z 2025-05-07T20:33:50.5133036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5133045Z 2025-05-07T20:33:50.5133145Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5133371Z self=, 2025-05-07T20:33:50.5133451Z T=2048, 2025-05-07T20:33:50.5133532Z D=7168, 2025-05-07T20:33:50.5133609Z scale_ub=None, 2025-05-07T20:33:50.5133698Z contiguous=False, 2025-05-07T20:33:50.5133777Z compiled=True, 2025-05-07T20:33:50.5133848Z ) 2025-05-07T20:33:50.5134072Z self = 2025-05-07T20:33:50.5134244Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:50.5134248Z 2025-05-07T20:33:50.5134329Z @given( 2025-05-07T20:33:50.5134444Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5134603Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5134717Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5134833Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5134997Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5135069Z ) 2025-05-07T20:33:50.5135316Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5135410Z def test_silu_mul_quant( 2025-05-07T20:33:50.5135487Z self, 2025-05-07T20:33:50.5135564Z T: int, 2025-05-07T20:33:50.5135637Z D: int, 2025-05-07T20:33:50.5135738Z scale_ub: Optional[float], 2025-05-07T20:33:50.5135823Z contiguous: bool, 2025-05-07T20:33:50.5135909Z compiled: bool, 2025-05-07T20:33:50.5135987Z ) -> None: 2025-05-07T20:33:50.5136077Z torch.manual_seed(2025) 2025-05-07T20:33:50.5136150Z 2025-05-07T20:33:50.5136323Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5136393Z 2025-05-07T20:33:50.5136488Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5136608Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5136695Z x = x_sign * x_clamp 2025-05-07T20:33:50.5136821Z x0 = x[:, :D] 2025-05-07T20:33:50.5136903Z x1 = x[:, D:] 2025-05-07T20:33:50.5136978Z 2025-05-07T20:33:50.5137069Z if contiguous: 2025-05-07T20:33:50.5137165Z x0 = x0.contiguous() 2025-05-07T20:33:50.5137325Z x1 = x1.contiguous() 2025-05-07T20:33:50.5137401Z 2025-05-07T20:33:50.5137498Z if scale_ub is not None: 2025-05-07T20:33:50.5137607Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5137741Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5137820Z ) 2025-05-07T20:33:50.5137904Z else: 2025-05-07T20:33:50.5138002Z scale_ub_tensor = None 2025-05-07T20:33:50.5138078Z 2025-05-07T20:33:50.5138206Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5138292Z op = silu_mul_quant 2025-05-07T20:33:50.5138372Z if compiled: 2025-05-07T20:33:50.5138481Z op = torch.compile(op) 2025-05-07T20:33:50.5138585Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5138659Z 2025-05-07T20:33:50.5138744Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5138749Z 2025-05-07T20:33:50.5138886Z moe/activation_test.py:117: 2025-05-07T20:33:50.5139026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5139124Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5139220Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5139602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.5139692Z return fn(*args, **kwargs) 2025-05-07T20:33:50.5140211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5140306Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5140681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5140910Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5141264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5141356Z kernel = self.compile( 2025-05-07T20:33:50.5141759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5141930Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5142059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5142063Z 2025-05-07T20:33:50.5142265Z self = 2025-05-07T20:33:50.5143069Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5143634Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c08400720>} 2025-05-07T20:33:50.5144428Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5144623Z context = 2025-05-07T20:33:50.5144627Z 2025-05-07T20:33:50.5144794Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5145064Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5145171Z module_map=module_map) 2025-05-07T20:33:50.5145377Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5145485Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5145560Z E ^ 2025-05-07T20:33:50.5145926Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5145970Z 2025-05-07T20:33:50.5146407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5146412Z 2025-05-07T20:33:50.5146512Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5146741Z self=, 2025-05-07T20:33:50.5146819Z T=4096, 2025-05-07T20:33:50.5146898Z D=7168, 2025-05-07T20:33:50.5146981Z scale_ub=None, 2025-05-07T20:33:50.5147064Z contiguous=False, 2025-05-07T20:33:50.5147147Z compiled=True, 2025-05-07T20:33:50.5147231Z ) 2025-05-07T20:33:50.5147453Z self = 2025-05-07T20:33:50.5147632Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:50.5147636Z 2025-05-07T20:33:50.5147714Z @given( 2025-05-07T20:33:50.5147871Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5147980Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5148096Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5148217Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5148341Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5148416Z ) 2025-05-07T20:33:50.5148668Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5148764Z def test_silu_mul_quant( 2025-05-07T20:33:50.5148846Z self, 2025-05-07T20:33:50.5148927Z T: int, 2025-05-07T20:33:50.5149006Z D: int, 2025-05-07T20:33:50.5149104Z scale_ub: Optional[float], 2025-05-07T20:33:50.5149191Z contiguous: bool, 2025-05-07T20:33:50.5149278Z compiled: bool, 2025-05-07T20:33:50.5149354Z ) -> None: 2025-05-07T20:33:50.5149454Z torch.manual_seed(2025) 2025-05-07T20:33:50.5149523Z 2025-05-07T20:33:50.5149694Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5149774Z 2025-05-07T20:33:50.5149862Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5149983Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5150068Z x = x_sign * x_clamp 2025-05-07T20:33:50.5150146Z x0 = x[:, :D] 2025-05-07T20:33:50.5150225Z x1 = x[:, D:] 2025-05-07T20:33:50.5150300Z 2025-05-07T20:33:50.5150380Z if contiguous: 2025-05-07T20:33:50.5150469Z x0 = x0.contiguous() 2025-05-07T20:33:50.5150560Z x1 = x1.contiguous() 2025-05-07T20:33:50.5150630Z 2025-05-07T20:33:50.5150720Z if scale_ub is not None: 2025-05-07T20:33:50.5150873Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5151007Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5151090Z ) 2025-05-07T20:33:50.5151162Z else: 2025-05-07T20:33:50.5151260Z scale_ub_tensor = None 2025-05-07T20:33:50.5151337Z 2025-05-07T20:33:50.5151465Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5151551Z op = silu_mul_quant 2025-05-07T20:33:50.5151639Z if compiled: 2025-05-07T20:33:50.5151736Z op = torch.compile(op) 2025-05-07T20:33:50.5151840Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5151915Z 2025-05-07T20:33:50.5152001Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5152005Z 2025-05-07T20:33:50.5152105Z moe/activation_test.py:117: 2025-05-07T20:33:50.5152235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5152334Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5152434Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5152858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.5152951Z return fn(*args, **kwargs) 2025-05-07T20:33:50.5153473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5153609Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5153987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5154215Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5154570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5154665Z kernel = self.compile( 2025-05-07T20:33:50.5155066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5155243Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5155427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5155432Z 2025-05-07T20:33:50.5155642Z self = 2025-05-07T20:33:50.5156451Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5156961Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c08401440>} 2025-05-07T20:33:50.5157755Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5157953Z context = 2025-05-07T20:33:50.5157958Z 2025-05-07T20:33:50.5158130Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5158408Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5158514Z module_map=module_map) 2025-05-07T20:33:50.5158686Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5158787Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5158867Z E ^ 2025-05-07T20:33:50.5159236Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5159241Z 2025-05-07T20:33:50.5159723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5159773Z 2025-05-07T20:33:50.5159876Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5160106Z self=, 2025-05-07T20:33:50.5160187Z T=16384, 2025-05-07T20:33:50.5160266Z D=5120, 2025-05-07T20:33:50.5160350Z scale_ub=1200.0, 2025-05-07T20:33:50.5160435Z contiguous=False, 2025-05-07T20:33:50.5160521Z compiled=False, 2025-05-07T20:33:50.5160597Z ) 2025-05-07T20:33:50.5160817Z self = 2025-05-07T20:33:50.5161003Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:50.5161007Z 2025-05-07T20:33:50.5161083Z @given( 2025-05-07T20:33:50.5161197Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5161297Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5161408Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5161572Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5161684Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5161759Z ) 2025-05-07T20:33:50.5162007Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5162135Z def test_silu_mul_quant( 2025-05-07T20:33:50.5162213Z self, 2025-05-07T20:33:50.5162288Z T: int, 2025-05-07T20:33:50.5162363Z D: int, 2025-05-07T20:33:50.5162462Z scale_ub: Optional[float], 2025-05-07T20:33:50.5162547Z contiguous: bool, 2025-05-07T20:33:50.5162626Z compiled: bool, 2025-05-07T20:33:50.5162706Z ) -> None: 2025-05-07T20:33:50.5162796Z torch.manual_seed(2025) 2025-05-07T20:33:50.5162869Z 2025-05-07T20:33:50.5163037Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5163109Z 2025-05-07T20:33:50.5163198Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5163321Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5163408Z x = x_sign * x_clamp 2025-05-07T20:33:50.5163488Z x0 = x[:, :D] 2025-05-07T20:33:50.5163562Z x1 = x[:, D:] 2025-05-07T20:33:50.5163676Z 2025-05-07T20:33:50.5163765Z if contiguous: 2025-05-07T20:33:50.5163862Z x0 = x0.contiguous() 2025-05-07T20:33:50.5163952Z x1 = x1.contiguous() 2025-05-07T20:33:50.5164027Z 2025-05-07T20:33:50.5164119Z if scale_ub is not None: 2025-05-07T20:33:50.5164229Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5164364Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5164438Z ) 2025-05-07T20:33:50.5164518Z else: 2025-05-07T20:33:50.5164612Z scale_ub_tensor = None 2025-05-07T20:33:50.5164687Z 2025-05-07T20:33:50.5164818Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5164912Z op = silu_mul_quant 2025-05-07T20:33:50.5165000Z if compiled: 2025-05-07T20:33:50.5165105Z op = torch.compile(op) 2025-05-07T20:33:50.5165211Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5165287Z 2025-05-07T20:33:50.5165386Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5165390Z 2025-05-07T20:33:50.5165490Z moe/activation_test.py:117: 2025-05-07T20:33:50.5165626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5165726Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5165825Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5166348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5166443Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5166815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5167094Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5167449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5167550Z kernel = self.compile( 2025-05-07T20:33:50.5167946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5168122Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5168252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5168256Z 2025-05-07T20:33:50.5168460Z self = 2025-05-07T20:33:50.5169264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5169841Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c08402340>} 2025-05-07T20:33:50.5170628Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5170862Z context = 2025-05-07T20:33:50.5170866Z 2025-05-07T20:33:50.5171036Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5171309Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5171417Z module_map=module_map) 2025-05-07T20:33:50.5171581Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5171688Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5171767Z E ^ 2025-05-07T20:33:50.5172138Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5172148Z 2025-05-07T20:33:50.5172619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5172627Z 2025-05-07T20:33:50.5172728Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5172959Z self=, 2025-05-07T20:33:50.5173035Z T=16384, 2025-05-07T20:33:50.5173112Z D=5120, 2025-05-07T20:33:50.5173197Z scale_ub=1200.0, 2025-05-07T20:33:50.5173281Z contiguous=True, 2025-05-07T20:33:50.5173360Z compiled=True, 2025-05-07T20:33:50.5173437Z ) 2025-05-07T20:33:50.5173659Z self = 2025-05-07T20:33:50.5173844Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:50.5173851Z 2025-05-07T20:33:50.5173928Z @given( 2025-05-07T20:33:50.5174041Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5174141Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5174252Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5174369Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5174485Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5174654Z ) 2025-05-07T20:33:50.5174904Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5174993Z def test_silu_mul_quant( 2025-05-07T20:33:50.5175065Z self, 2025-05-07T20:33:50.5175142Z T: int, 2025-05-07T20:33:50.5175216Z D: int, 2025-05-07T20:33:50.5175312Z scale_ub: Optional[float], 2025-05-07T20:33:50.5175402Z contiguous: bool, 2025-05-07T20:33:50.5175537Z compiled: bool, 2025-05-07T20:33:50.5175611Z ) -> None: 2025-05-07T20:33:50.5175710Z torch.manual_seed(2025) 2025-05-07T20:33:50.5175782Z 2025-05-07T20:33:50.5175949Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5176031Z 2025-05-07T20:33:50.5176120Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5176242Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5176333Z x = x_sign * x_clamp 2025-05-07T20:33:50.5176411Z x0 = x[:, :D] 2025-05-07T20:33:50.5176493Z x1 = x[:, D:] 2025-05-07T20:33:50.5176564Z 2025-05-07T20:33:50.5176644Z if contiguous: 2025-05-07T20:33:50.5176738Z x0 = x0.contiguous() 2025-05-07T20:33:50.5176825Z x1 = x1.contiguous() 2025-05-07T20:33:50.5176895Z 2025-05-07T20:33:50.5176986Z if scale_ub is not None: 2025-05-07T20:33:50.5177089Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5177223Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5177307Z ) 2025-05-07T20:33:50.5177429Z else: 2025-05-07T20:33:50.5177522Z scale_ub_tensor = None 2025-05-07T20:33:50.5177598Z 2025-05-07T20:33:50.5177728Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5177823Z op = silu_mul_quant 2025-05-07T20:33:50.5177947Z if compiled: 2025-05-07T20:33:50.5178045Z op = torch.compile(op) 2025-05-07T20:33:50.5178154Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5178226Z 2025-05-07T20:33:50.5178318Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5178323Z 2025-05-07T20:33:50.5178422Z moe/activation_test.py:117: 2025-05-07T20:33:50.5178553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5178653Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5178757Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5179147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.5179244Z return fn(*args, **kwargs) 2025-05-07T20:33:50.5179801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5179903Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5180277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5180501Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5180853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5180948Z kernel = self.compile( 2025-05-07T20:33:50.5181346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5181524Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5181657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5181661Z 2025-05-07T20:33:50.5181877Z self = 2025-05-07T20:33:50.5182688Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5183201Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1c084039c0>} 2025-05-07T20:33:50.5183998Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5184236Z context = 2025-05-07T20:33:50.5184241Z 2025-05-07T20:33:50.5184414Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5184683Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5184789Z module_map=module_map) 2025-05-07T20:33:50.5184953Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5185047Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5185122Z E ^ 2025-05-07T20:33:50.5185493Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5185498Z 2025-05-07T20:33:50.5185929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5185934Z 2025-05-07T20:33:50.5186042Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5186312Z self=, 2025-05-07T20:33:50.5186396Z T=16384, 2025-05-07T20:33:50.5186479Z D=5120, 2025-05-07T20:33:50.5186559Z scale_ub=None, 2025-05-07T20:33:50.5186647Z contiguous=False, 2025-05-07T20:33:50.5186778Z compiled=True, 2025-05-07T20:33:50.5186849Z ) 2025-05-07T20:33:50.5187072Z self = 2025-05-07T20:33:50.5187256Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:50.5187261Z 2025-05-07T20:33:50.5187340Z @given( 2025-05-07T20:33:50.5187462Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5187562Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5187676Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5187800Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5187917Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5187995Z ) 2025-05-07T20:33:50.5188252Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5188344Z def test_silu_mul_quant( 2025-05-07T20:33:50.5188467Z self, 2025-05-07T20:33:50.5188548Z T: int, 2025-05-07T20:33:50.5188626Z D: int, 2025-05-07T20:33:50.5188727Z scale_ub: Optional[float], 2025-05-07T20:33:50.5188812Z contiguous: bool, 2025-05-07T20:33:50.5188894Z compiled: bool, 2025-05-07T20:33:50.5188972Z ) -> None: 2025-05-07T20:33:50.5189062Z torch.manual_seed(2025) 2025-05-07T20:33:50.5189136Z 2025-05-07T20:33:50.5189307Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5189383Z 2025-05-07T20:33:50.5189473Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5189601Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5189689Z x = x_sign * x_clamp 2025-05-07T20:33:50.5189770Z x0 = x[:, :D] 2025-05-07T20:33:50.5189853Z x1 = x[:, D:] 2025-05-07T20:33:50.5189926Z 2025-05-07T20:33:50.5190011Z if contiguous: 2025-05-07T20:33:50.5190102Z x0 = x0.contiguous() 2025-05-07T20:33:50.5190194Z x1 = x1.contiguous() 2025-05-07T20:33:50.5190272Z 2025-05-07T20:33:50.5190360Z if scale_ub is not None: 2025-05-07T20:33:50.5190461Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5190595Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5190667Z ) 2025-05-07T20:33:50.5190744Z else: 2025-05-07T20:33:50.5190840Z scale_ub_tensor = None 2025-05-07T20:33:50.5190912Z 2025-05-07T20:33:50.5191038Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5191134Z op = silu_mul_quant 2025-05-07T20:33:50.5191214Z if compiled: 2025-05-07T20:33:50.5191314Z op = torch.compile(op) 2025-05-07T20:33:50.5191464Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5191542Z 2025-05-07T20:33:50.5191632Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5191637Z 2025-05-07T20:33:50.5191728Z moe/activation_test.py:117: 2025-05-07T20:33:50.5191860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5191963Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5192059Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5192438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.5192529Z return fn(*args, **kwargs) 2025-05-07T20:33:50.5193042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5193146Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5193518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5193789Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5194150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5194240Z kernel = self.compile( 2025-05-07T20:33:50.5194685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5194858Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5194986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5194990Z 2025-05-07T20:33:50.5195204Z self = 2025-05-07T20:33:50.5196010Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5197276Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d1a0c20>} 2025-05-07T20:33:50.5198063Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5198262Z context = 2025-05-07T20:33:50.5198267Z 2025-05-07T20:33:50.5198436Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5198703Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5198813Z module_map=module_map) 2025-05-07T20:33:50.5198978Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5199081Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5199159Z E ^ 2025-05-07T20:33:50.5199526Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5199530Z 2025-05-07T20:33:50.5199967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5199971Z 2025-05-07T20:33:50.5200070Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5200330Z self=, 2025-05-07T20:33:50.5200420Z T=2048, 2025-05-07T20:33:50.5200497Z D=5120, 2025-05-07T20:33:50.5200580Z scale_ub=None, 2025-05-07T20:33:50.5200664Z contiguous=False, 2025-05-07T20:33:50.5200743Z compiled=True, 2025-05-07T20:33:50.5200814Z ) 2025-05-07T20:33:50.5201042Z self = 2025-05-07T20:33:50.5201290Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:50.5201294Z 2025-05-07T20:33:50.5201373Z @given( 2025-05-07T20:33:50.5201491Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5201589Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5201704Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5201819Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5201929Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5202008Z ) 2025-05-07T20:33:50.5202255Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5202347Z def test_silu_mul_quant( 2025-05-07T20:33:50.5202423Z self, 2025-05-07T20:33:50.5202499Z T: int, 2025-05-07T20:33:50.5202577Z D: int, 2025-05-07T20:33:50.5202671Z scale_ub: Optional[float], 2025-05-07T20:33:50.5202757Z contiguous: bool, 2025-05-07T20:33:50.5202842Z compiled: bool, 2025-05-07T20:33:50.5202961Z ) -> None: 2025-05-07T20:33:50.5203057Z torch.manual_seed(2025) 2025-05-07T20:33:50.5203133Z 2025-05-07T20:33:50.5203305Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5203418Z 2025-05-07T20:33:50.5203509Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5203632Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5203723Z x = x_sign * x_clamp 2025-05-07T20:33:50.5203806Z x0 = x[:, :D] 2025-05-07T20:33:50.5203888Z x1 = x[:, D:] 2025-05-07T20:33:50.5203963Z 2025-05-07T20:33:50.5204052Z if contiguous: 2025-05-07T20:33:50.5204147Z x0 = x0.contiguous() 2025-05-07T20:33:50.5204245Z x1 = x1.contiguous() 2025-05-07T20:33:50.5204318Z 2025-05-07T20:33:50.5204409Z if scale_ub is not None: 2025-05-07T20:33:50.5204521Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5204664Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5204743Z ) 2025-05-07T20:33:50.5204831Z else: 2025-05-07T20:33:50.5204930Z scale_ub_tensor = None 2025-05-07T20:33:50.5205053Z 2025-05-07T20:33:50.5205190Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5205290Z op = silu_mul_quant 2025-05-07T20:33:50.5205381Z if compiled: 2025-05-07T20:33:50.5205489Z op = torch.compile(op) 2025-05-07T20:33:50.5205598Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5205679Z 2025-05-07T20:33:50.5205770Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5205774Z 2025-05-07T20:33:50.5205869Z moe/activation_test.py:117: 2025-05-07T20:33:50.5206003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5206103Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5206207Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5206593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.5206683Z return fn(*args, **kwargs) 2025-05-07T20:33:50.5207202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5207300Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5207669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5207897Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5208250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5208340Z kernel = self.compile( 2025-05-07T20:33:50.5208741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5208964Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5209094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5209099Z 2025-05-07T20:33:50.5209307Z self = 2025-05-07T20:33:50.5210168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5210680Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d1a19e0>} 2025-05-07T20:33:50.5211462Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5211704Z context = 2025-05-07T20:33:50.5211709Z 2025-05-07T20:33:50.5211878Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5212152Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5212297Z module_map=module_map) 2025-05-07T20:33:50.5212453Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5212552Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5212630Z E ^ 2025-05-07T20:33:50.5212995Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5213000Z 2025-05-07T20:33:50.5213438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5213445Z 2025-05-07T20:33:50.5213548Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5213777Z self=, 2025-05-07T20:33:50.5213852Z T=2048, 2025-05-07T20:33:50.5213966Z D=5120, 2025-05-07T20:33:50.5214058Z scale_ub=1200.0, 2025-05-07T20:33:50.5214149Z contiguous=False, 2025-05-07T20:33:50.5214231Z compiled=True, 2025-05-07T20:33:50.5214311Z ) 2025-05-07T20:33:50.5214615Z self = 2025-05-07T20:33:50.5214792Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:50.5214796Z 2025-05-07T20:33:50.5214879Z @given( 2025-05-07T20:33:50.5214997Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5215100Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5215211Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5215326Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5215447Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5215517Z ) 2025-05-07T20:33:50.5215766Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5215865Z def test_silu_mul_quant( 2025-05-07T20:33:50.5215943Z self, 2025-05-07T20:33:50.5216020Z T: int, 2025-05-07T20:33:50.5216096Z D: int, 2025-05-07T20:33:50.5216191Z scale_ub: Optional[float], 2025-05-07T20:33:50.5216282Z contiguous: bool, 2025-05-07T20:33:50.5216363Z compiled: bool, 2025-05-07T20:33:50.5216437Z ) -> None: 2025-05-07T20:33:50.5216532Z torch.manual_seed(2025) 2025-05-07T20:33:50.5216601Z 2025-05-07T20:33:50.5216772Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5216844Z 2025-05-07T20:33:50.5216933Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5217059Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5217197Z x = x_sign * x_clamp 2025-05-07T20:33:50.5217276Z x0 = x[:, :D] 2025-05-07T20:33:50.5217353Z x1 = x[:, D:] 2025-05-07T20:33:50.5217428Z 2025-05-07T20:33:50.5217508Z if contiguous: 2025-05-07T20:33:50.5217599Z x0 = x0.contiguous() 2025-05-07T20:33:50.5217692Z x1 = x1.contiguous() 2025-05-07T20:33:50.5217762Z 2025-05-07T20:33:50.5217855Z if scale_ub is not None: 2025-05-07T20:33:50.5217958Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5218090Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5218167Z ) 2025-05-07T20:33:50.5218243Z else: 2025-05-07T20:33:50.5218337Z scale_ub_tensor = None 2025-05-07T20:33:50.5218409Z 2025-05-07T20:33:50.5218535Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5218620Z op = silu_mul_quant 2025-05-07T20:33:50.5218705Z if compiled: 2025-05-07T20:33:50.5218802Z op = torch.compile(op) 2025-05-07T20:33:50.5218947Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5219018Z 2025-05-07T20:33:50.5219105Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5219112Z 2025-05-07T20:33:50.5219208Z moe/activation_test.py:117: 2025-05-07T20:33:50.5219373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5219471Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5219573Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5219955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.5220048Z return fn(*args, **kwargs) 2025-05-07T20:33:50.5220564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5220662Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5221043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5221267Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5221658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5221757Z kernel = self.compile( 2025-05-07T20:33:50.5222158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5222329Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5222461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5222466Z 2025-05-07T20:33:50.5222670Z self = 2025-05-07T20:33:50.5228322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5228885Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d1a2b60>} 2025-05-07T20:33:50.5229685Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5229885Z context = 2025-05-07T20:33:50.5229890Z 2025-05-07T20:33:50.5230058Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5230330Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5230554Z module_map=module_map) 2025-05-07T20:33:50.5230723Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5230828Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5230906Z E ^ 2025-05-07T20:33:50.5231279Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5231287Z 2025-05-07T20:33:50.5231724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5231728Z 2025-05-07T20:33:50.5231835Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5232073Z self=, 2025-05-07T20:33:50.5232151Z T=4096, 2025-05-07T20:33:50.5232226Z D=5120, 2025-05-07T20:33:50.5232311Z scale_ub=1200.0, 2025-05-07T20:33:50.5232395Z contiguous=True, 2025-05-07T20:33:50.5232475Z compiled=True, 2025-05-07T20:33:50.5232553Z ) 2025-05-07T20:33:50.5232842Z self = 2025-05-07T20:33:50.5233018Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:50.5233022Z 2025-05-07T20:33:50.5233101Z @given( 2025-05-07T20:33:50.5233222Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5233404Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5233521Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5233645Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5233767Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5233846Z ) 2025-05-07T20:33:50.5234105Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5234204Z def test_silu_mul_quant( 2025-05-07T20:33:50.5234285Z self, 2025-05-07T20:33:50.5234369Z T: int, 2025-05-07T20:33:50.5234459Z D: int, 2025-05-07T20:33:50.5234563Z scale_ub: Optional[float], 2025-05-07T20:33:50.5234659Z contiguous: bool, 2025-05-07T20:33:50.5234752Z compiled: bool, 2025-05-07T20:33:50.5234828Z ) -> None: 2025-05-07T20:33:50.5234928Z torch.manual_seed(2025) 2025-05-07T20:33:50.5235092Z 2025-05-07T20:33:50.5235266Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5235349Z 2025-05-07T20:33:50.5235443Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5235574Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5235667Z x = x_sign * x_clamp 2025-05-07T20:33:50.5235747Z x0 = x[:, :D] 2025-05-07T20:33:50.5235827Z x1 = x[:, D:] 2025-05-07T20:33:50.5235905Z 2025-05-07T20:33:50.5235993Z if contiguous: 2025-05-07T20:33:50.5236087Z x0 = x0.contiguous() 2025-05-07T20:33:50.5236185Z x1 = x1.contiguous() 2025-05-07T20:33:50.5236261Z 2025-05-07T20:33:50.5236360Z if scale_ub is not None: 2025-05-07T20:33:50.5236470Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5236613Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5236697Z ) 2025-05-07T20:33:50.5236779Z else: 2025-05-07T20:33:50.5236876Z scale_ub_tensor = None 2025-05-07T20:33:50.5236956Z 2025-05-07T20:33:50.5237087Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5237182Z op = silu_mul_quant 2025-05-07T20:33:50.5237279Z if compiled: 2025-05-07T20:33:50.5237386Z op = torch.compile(op) 2025-05-07T20:33:50.5237494Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5237578Z 2025-05-07T20:33:50.5237671Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5237676Z 2025-05-07T20:33:50.5237785Z moe/activation_test.py:117: 2025-05-07T20:33:50.5237920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5238072Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5238181Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5238566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.5238663Z return fn(*args, **kwargs) 2025-05-07T20:33:50.5239189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5239292Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5239668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5239896Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5240251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5240347Z kernel = self.compile( 2025-05-07T20:33:50.5240795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5240977Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5241111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5241116Z 2025-05-07T20:33:50.5241369Z self = 2025-05-07T20:33:50.5242181Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5242696Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d054180>} 2025-05-07T20:33:50.5244893Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5245090Z context = 2025-05-07T20:33:50.5245094Z 2025-05-07T20:33:50.5245300Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5245579Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5245687Z module_map=module_map) 2025-05-07T20:33:50.5245850Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5245953Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5246028Z E ^ 2025-05-07T20:33:50.5246406Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5246411Z 2025-05-07T20:33:50.5246843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5246852Z 2025-05-07T20:33:50.5246962Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5247198Z self=, 2025-05-07T20:33:50.5247274Z T=128, 2025-05-07T20:33:50.5247357Z D=5120, 2025-05-07T20:33:50.5247441Z scale_ub=1200.0, 2025-05-07T20:33:50.5247526Z contiguous=False, 2025-05-07T20:33:50.5247610Z compiled=True, 2025-05-07T20:33:50.5247687Z ) 2025-05-07T20:33:50.5247911Z self = 2025-05-07T20:33:50.5248090Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:50.5248094Z 2025-05-07T20:33:50.5248172Z @given( 2025-05-07T20:33:50.5248289Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5248393Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5248506Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5248690Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5248803Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5248878Z ) 2025-05-07T20:33:50.5249137Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5249232Z def test_silu_mul_quant( 2025-05-07T20:33:50.5249310Z self, 2025-05-07T20:33:50.5249391Z T: int, 2025-05-07T20:33:50.5249479Z D: int, 2025-05-07T20:33:50.5249595Z scale_ub: Optional[float], 2025-05-07T20:33:50.5249701Z contiguous: bool, 2025-05-07T20:33:50.5249797Z compiled: bool, 2025-05-07T20:33:50.5249874Z ) -> None: 2025-05-07T20:33:50.5249975Z torch.manual_seed(2025) 2025-05-07T20:33:50.5250047Z 2025-05-07T20:33:50.5250220Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5250296Z 2025-05-07T20:33:50.5250394Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5250568Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5250661Z x = x_sign * x_clamp 2025-05-07T20:33:50.5250743Z x0 = x[:, :D] 2025-05-07T20:33:50.5250828Z x1 = x[:, D:] 2025-05-07T20:33:50.5250905Z 2025-05-07T20:33:50.5250988Z if contiguous: 2025-05-07T20:33:50.5251134Z x0 = x0.contiguous() 2025-05-07T20:33:50.5251226Z x1 = x1.contiguous() 2025-05-07T20:33:50.5251301Z 2025-05-07T20:33:50.5251394Z if scale_ub is not None: 2025-05-07T20:33:50.5251500Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5251635Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5251721Z ) 2025-05-07T20:33:50.5251798Z else: 2025-05-07T20:33:50.5251898Z scale_ub_tensor = None 2025-05-07T20:33:50.5251974Z 2025-05-07T20:33:50.5252107Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5252213Z op = silu_mul_quant 2025-05-07T20:33:50.5252302Z if compiled: 2025-05-07T20:33:50.5252412Z op = torch.compile(op) 2025-05-07T20:33:50.5252527Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5252607Z 2025-05-07T20:33:50.5252745Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5252752Z 2025-05-07T20:33:50.5252856Z moe/activation_test.py:117: 2025-05-07T20:33:50.5252991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5253100Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5253202Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5253590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.5253691Z return fn(*args, **kwargs) 2025-05-07T20:33:50.5254216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5254318Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5254777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5255008Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5255368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5255463Z kernel = self.compile( 2025-05-07T20:33:50.5255864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5256041Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5256173Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5256178Z 2025-05-07T20:33:50.5256394Z self = 2025-05-07T20:33:50.5257257Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5257779Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d054ea0>} 2025-05-07T20:33:50.5258572Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5258766Z context = 2025-05-07T20:33:50.5258771Z 2025-05-07T20:33:50.5258943Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5259216Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5259365Z module_map=module_map) 2025-05-07T20:33:50.5259533Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5259635Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5259720Z E ^ 2025-05-07T20:33:50.5260136Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5260191Z 2025-05-07T20:33:50.5260629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5260634Z 2025-05-07T20:33:50.5260744Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5260970Z self=, 2025-05-07T20:33:50.5261048Z T=16384, 2025-05-07T20:33:50.5261132Z D=7168, 2025-05-07T20:33:50.5261215Z scale_ub=1200.0, 2025-05-07T20:33:50.5261301Z contiguous=True, 2025-05-07T20:33:50.5261387Z compiled=True, 2025-05-07T20:33:50.5261464Z ) 2025-05-07T20:33:50.5261689Z self = 2025-05-07T20:33:50.5261867Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:50.5261914Z 2025-05-07T20:33:50.5261999Z @given( 2025-05-07T20:33:50.5262129Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5262231Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5262349Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5262471Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5262586Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5262669Z ) 2025-05-07T20:33:50.5262920Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5263019Z def test_silu_mul_quant( 2025-05-07T20:33:50.5263101Z self, 2025-05-07T20:33:50.5263184Z T: int, 2025-05-07T20:33:50.5263263Z D: int, 2025-05-07T20:33:50.5263373Z scale_ub: Optional[float], 2025-05-07T20:33:50.5263465Z contiguous: bool, 2025-05-07T20:33:50.5263552Z compiled: bool, 2025-05-07T20:33:50.5263633Z ) -> None: 2025-05-07T20:33:50.5263731Z torch.manual_seed(2025) 2025-05-07T20:33:50.5263810Z 2025-05-07T20:33:50.5263986Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5264061Z 2025-05-07T20:33:50.5264158Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5264286Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5264383Z x = x_sign * x_clamp 2025-05-07T20:33:50.5264474Z x0 = x[:, :D] 2025-05-07T20:33:50.5264561Z x1 = x[:, D:] 2025-05-07T20:33:50.5264641Z 2025-05-07T20:33:50.5264735Z if contiguous: 2025-05-07T20:33:50.5264832Z x0 = x0.contiguous() 2025-05-07T20:33:50.5264929Z x1 = x1.contiguous() 2025-05-07T20:33:50.5265066Z 2025-05-07T20:33:50.5265162Z if scale_ub is not None: 2025-05-07T20:33:50.5265276Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5265417Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5265498Z ) 2025-05-07T20:33:50.5265576Z else: 2025-05-07T20:33:50.5265679Z scale_ub_tensor = None 2025-05-07T20:33:50.5265757Z 2025-05-07T20:33:50.5265891Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5265986Z op = silu_mul_quant 2025-05-07T20:33:50.5266074Z if compiled: 2025-05-07T20:33:50.5266179Z op = torch.compile(op) 2025-05-07T20:33:50.5266287Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5266362Z 2025-05-07T20:33:50.5266467Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5266472Z 2025-05-07T20:33:50.5266572Z moe/activation_test.py:117: 2025-05-07T20:33:50.5266704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5266858Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5266963Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5267358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.5267538Z return fn(*args, **kwargs) 2025-05-07T20:33:50.5268055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5268155Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5268526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5268753Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5269112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5269207Z kernel = self.compile( 2025-05-07T20:33:50.5269614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5269792Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5269961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5269972Z 2025-05-07T20:33:50.5270182Z self = 2025-05-07T20:33:50.5270992Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5271510Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d0560c0>} 2025-05-07T20:33:50.5272306Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5272509Z context = 2025-05-07T20:33:50.5272514Z 2025-05-07T20:33:50.5272683Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5272953Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5273060Z module_map=module_map) 2025-05-07T20:33:50.5273221Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5273320Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5273404Z E ^ 2025-05-07T20:33:50.5273778Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5273826Z 2025-05-07T20:33:50.5274269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5274274Z 2025-05-07T20:33:50.5274380Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5274608Z self=, 2025-05-07T20:33:50.5274693Z T=16384, 2025-05-07T20:33:50.5274772Z D=5120, 2025-05-07T20:33:50.5274855Z scale_ub=1200.0, 2025-05-07T20:33:50.5274948Z contiguous=True, 2025-05-07T20:33:50.5275033Z compiled=False, 2025-05-07T20:33:50.5275104Z ) 2025-05-07T20:33:50.5275334Z self = 2025-05-07T20:33:50.5275516Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:50.5275521Z 2025-05-07T20:33:50.5275606Z @given( 2025-05-07T20:33:50.5275728Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5275834Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5275996Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5276117Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5276240Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5276326Z ) 2025-05-07T20:33:50.5276576Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5276715Z def test_silu_mul_quant( 2025-05-07T20:33:50.5276790Z self, 2025-05-07T20:33:50.5276866Z T: int, 2025-05-07T20:33:50.5276947Z D: int, 2025-05-07T20:33:50.5277046Z scale_ub: Optional[float], 2025-05-07T20:33:50.5277139Z contiguous: bool, 2025-05-07T20:33:50.5277234Z compiled: bool, 2025-05-07T20:33:50.5277312Z ) -> None: 2025-05-07T20:33:50.5277410Z torch.manual_seed(2025) 2025-05-07T20:33:50.5277488Z 2025-05-07T20:33:50.5277661Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5277744Z 2025-05-07T20:33:50.5277838Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5277970Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5278066Z x = x_sign * x_clamp 2025-05-07T20:33:50.5278188Z x0 = x[:, :D] 2025-05-07T20:33:50.5278269Z x1 = x[:, D:] 2025-05-07T20:33:50.5278362Z 2025-05-07T20:33:50.5278448Z if contiguous: 2025-05-07T20:33:50.5278542Z x0 = x0.contiguous() 2025-05-07T20:33:50.5278642Z x1 = x1.contiguous() 2025-05-07T20:33:50.5278717Z 2025-05-07T20:33:50.5278813Z if scale_ub is not None: 2025-05-07T20:33:50.5278926Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5279063Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5279143Z ) 2025-05-07T20:33:50.5279229Z else: 2025-05-07T20:33:50.5279328Z scale_ub_tensor = None 2025-05-07T20:33:50.5279403Z 2025-05-07T20:33:50.5279537Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5279632Z op = silu_mul_quant 2025-05-07T20:33:50.5279720Z if compiled: 2025-05-07T20:33:50.5279825Z op = torch.compile(op) 2025-05-07T20:33:50.5279934Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5280028Z 2025-05-07T20:33:50.5280143Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5280149Z 2025-05-07T20:33:50.5280257Z moe/activation_test.py:117: 2025-05-07T20:33:50.5280413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5280513Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5280610Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5281142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5281241Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5281624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5281898Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5282255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5282359Z kernel = self.compile( 2025-05-07T20:33:50.5282759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5282939Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5283067Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5283071Z 2025-05-07T20:33:50.5283278Z self = 2025-05-07T20:33:50.5284126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5284649Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1d055a80>} 2025-05-07T20:33:50.5285489Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5285683Z context = 2025-05-07T20:33:50.5285688Z 2025-05-07T20:33:50.5285857Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5286137Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5286246Z module_map=module_map) 2025-05-07T20:33:50.5286420Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5286519Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5286595Z E ^ 2025-05-07T20:33:50.5287012Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5287019Z 2025-05-07T20:33:50.5287453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5287457Z 2025-05-07T20:33:50.5287563Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5287794Z self=, 2025-05-07T20:33:50.5287877Z T=1, 2025-05-07T20:33:50.5287960Z D=7168, 2025-05-07T20:33:50.5288045Z scale_ub=1200.0, 2025-05-07T20:33:50.5288134Z contiguous=False, 2025-05-07T20:33:50.5288232Z compiled=False, 2025-05-07T20:33:50.5288311Z ) 2025-05-07T20:33:50.5288541Z self = 2025-05-07T20:33:50.5288723Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:50.5288727Z 2025-05-07T20:33:50.5288814Z @given( 2025-05-07T20:33:50.5288943Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5289044Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5289167Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5289288Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5289404Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5289477Z ) 2025-05-07T20:33:50.5289759Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5289871Z def test_silu_mul_quant( 2025-05-07T20:33:50.5289958Z self, 2025-05-07T20:33:50.5290041Z T: int, 2025-05-07T20:33:50.5290123Z D: int, 2025-05-07T20:33:50.5290221Z scale_ub: Optional[float], 2025-05-07T20:33:50.5290376Z contiguous: bool, 2025-05-07T20:33:50.5290466Z compiled: bool, 2025-05-07T20:33:50.5290549Z ) -> None: 2025-05-07T20:33:50.5290644Z torch.manual_seed(2025) 2025-05-07T20:33:50.5290725Z 2025-05-07T20:33:50.5290897Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5290972Z 2025-05-07T20:33:50.5291071Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5291194Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5291286Z x = x_sign * x_clamp 2025-05-07T20:33:50.5291376Z x0 = x[:, :D] 2025-05-07T20:33:50.5291457Z x1 = x[:, D:] 2025-05-07T20:33:50.5291532Z 2025-05-07T20:33:50.5291627Z if contiguous: 2025-05-07T20:33:50.5291719Z x0 = x0.contiguous() 2025-05-07T20:33:50.5291814Z x1 = x1.contiguous() 2025-05-07T20:33:50.5291890Z 2025-05-07T20:33:50.5291982Z if scale_ub is not None: 2025-05-07T20:33:50.5292097Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5292279Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5292359Z ) 2025-05-07T20:33:50.5292444Z else: 2025-05-07T20:33:50.5292543Z scale_ub_tensor = None 2025-05-07T20:33:50.5292622Z 2025-05-07T20:33:50.5292755Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5292891Z op = silu_mul_quant 2025-05-07T20:33:50.5292979Z if compiled: 2025-05-07T20:33:50.5293086Z op = torch.compile(op) 2025-05-07T20:33:50.5293193Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5293274Z 2025-05-07T20:33:50.5293367Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5293371Z 2025-05-07T20:33:50.5293471Z moe/activation_test.py:117: 2025-05-07T20:33:50.5293606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5293710Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5293819Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5294354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5294578Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5294957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5295193Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5295551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5295650Z kernel = self.compile( 2025-05-07T20:33:50.5296050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5296227Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5296362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5296370Z 2025-05-07T20:33:50.5296582Z self = 2025-05-07T20:33:50.5297397Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5297914Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1cdfc0e0>} 2025-05-07T20:33:50.5298707Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5298901Z context = 2025-05-07T20:33:50.5298952Z 2025-05-07T20:33:50.5299125Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5299406Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5299524Z module_map=module_map) 2025-05-07T20:33:50.5299692Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5299800Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5299880Z E ^ 2025-05-07T20:33:50.5300253Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5300257Z 2025-05-07T20:33:50.5300688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5300692Z 2025-05-07T20:33:50.5300794Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5301026Z self=, 2025-05-07T20:33:50.5301105Z T=4096, 2025-05-07T20:33:50.5301255Z D=7168, 2025-05-07T20:33:50.5301341Z scale_ub=1200.0, 2025-05-07T20:33:50.5301431Z contiguous=False, 2025-05-07T20:33:50.5301519Z compiled=True, 2025-05-07T20:33:50.5301596Z ) 2025-05-07T20:33:50.5301824Z self = 2025-05-07T20:33:50.5302050Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:50.5302054Z 2025-05-07T20:33:50.5302134Z @given( 2025-05-07T20:33:50.5302252Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5302358Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5302478Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5302606Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5302723Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5302803Z ) 2025-05-07T20:33:50.5303069Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5303171Z def test_silu_mul_quant( 2025-05-07T20:33:50.5303254Z self, 2025-05-07T20:33:50.5303337Z T: int, 2025-05-07T20:33:50.5303415Z D: int, 2025-05-07T20:33:50.5303558Z scale_ub: Optional[float], 2025-05-07T20:33:50.5303653Z contiguous: bool, 2025-05-07T20:33:50.5303738Z compiled: bool, 2025-05-07T20:33:50.5303816Z ) -> None: 2025-05-07T20:33:50.5303918Z torch.manual_seed(2025) 2025-05-07T20:33:50.5303989Z 2025-05-07T20:33:50.5304159Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5304235Z 2025-05-07T20:33:50.5304327Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5304452Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5304540Z x = x_sign * x_clamp 2025-05-07T20:33:50.5304620Z x0 = x[:, :D] 2025-05-07T20:33:50.5304705Z x1 = x[:, D:] 2025-05-07T20:33:50.5304779Z 2025-05-07T20:33:50.5304861Z if contiguous: 2025-05-07T20:33:50.5304967Z x0 = x0.contiguous() 2025-05-07T20:33:50.5305058Z x1 = x1.contiguous() 2025-05-07T20:33:50.5305132Z 2025-05-07T20:33:50.5305234Z if scale_ub is not None: 2025-05-07T20:33:50.5305344Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5305482Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5305567Z ) 2025-05-07T20:33:50.5305647Z else: 2025-05-07T20:33:50.5305747Z scale_ub_tensor = None 2025-05-07T20:33:50.5305822Z 2025-05-07T20:33:50.5305952Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5306048Z op = silu_mul_quant 2025-05-07T20:33:50.5306131Z if compiled: 2025-05-07T20:33:50.5306235Z op = torch.compile(op) 2025-05-07T20:33:50.5306344Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5306468Z 2025-05-07T20:33:50.5306560Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5306567Z 2025-05-07T20:33:50.5306672Z moe/activation_test.py:117: 2025-05-07T20:33:50.5306804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5306915Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5307022Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5307408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.5307506Z return fn(*args, **kwargs) 2025-05-07T20:33:50.5308020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5308116Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5308494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5308721Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5309124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5309220Z kernel = self.compile( 2025-05-07T20:33:50.5309620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5309842Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5309972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5309976Z 2025-05-07T20:33:50.5310184Z self = 2025-05-07T20:33:50.5311002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5311529Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1cdfd300>} 2025-05-07T20:33:50.5312367Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5312564Z context = 2025-05-07T20:33:50.5312569Z 2025-05-07T20:33:50.5312742Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5313012Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5313124Z module_map=module_map) 2025-05-07T20:33:50.5313289Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5313389Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5313473Z E ^ 2025-05-07T20:33:50.5313854Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5313859Z 2025-05-07T20:33:50.5314296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5314304Z 2025-05-07T20:33:50.5314425Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5314650Z self=, 2025-05-07T20:33:50.5314728Z T=128, 2025-05-07T20:33:50.5314817Z D=7168, 2025-05-07T20:33:50.5314901Z scale_ub=1200.0, 2025-05-07T20:33:50.5314989Z contiguous=False, 2025-05-07T20:33:50.5315083Z compiled=True, 2025-05-07T20:33:50.5315160Z ) 2025-05-07T20:33:50.5315383Z self = 2025-05-07T20:33:50.5315561Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True 2025-05-07T20:33:50.5315618Z 2025-05-07T20:33:50.5315705Z @given( 2025-05-07T20:33:50.5315839Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5315943Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5316066Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5316193Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5316309Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5316395Z ) 2025-05-07T20:33:50.5316650Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5316746Z def test_silu_mul_quant( 2025-05-07T20:33:50.5316832Z self, 2025-05-07T20:33:50.5316916Z T: int, 2025-05-07T20:33:50.5316992Z D: int, 2025-05-07T20:33:50.5317092Z scale_ub: Optional[float], 2025-05-07T20:33:50.5317181Z contiguous: bool, 2025-05-07T20:33:50.5317264Z compiled: bool, 2025-05-07T20:33:50.5317354Z ) -> None: 2025-05-07T20:33:50.5317449Z torch.manual_seed(2025) 2025-05-07T20:33:50.5317571Z 2025-05-07T20:33:50.5317749Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5317824Z 2025-05-07T20:33:50.5317917Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5318044Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5318178Z x = x_sign * x_clamp 2025-05-07T20:33:50.5318268Z x0 = x[:, :D] 2025-05-07T20:33:50.5318353Z x1 = x[:, D:] 2025-05-07T20:33:50.5318430Z 2025-05-07T20:33:50.5318519Z if contiguous: 2025-05-07T20:33:50.5318614Z x0 = x0.contiguous() 2025-05-07T20:33:50.5318705Z x1 = x1.contiguous() 2025-05-07T20:33:50.5318786Z 2025-05-07T20:33:50.5318880Z if scale_ub is not None: 2025-05-07T20:33:50.5318989Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5319129Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5319211Z ) 2025-05-07T20:33:50.5319292Z else: 2025-05-07T20:33:50.5319395Z scale_ub_tensor = None 2025-05-07T20:33:50.5319472Z 2025-05-07T20:33:50.5319605Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5319740Z op = silu_mul_quant 2025-05-07T20:33:50.5319834Z if compiled: 2025-05-07T20:33:50.5319940Z op = torch.compile(op) 2025-05-07T20:33:50.5320048Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5320124Z 2025-05-07T20:33:50.5320223Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5320228Z 2025-05-07T20:33:50.5320329Z moe/activation_test.py:117: 2025-05-07T20:33:50.5320464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5320574Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5320678Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5321070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.5321171Z return fn(*args, **kwargs) 2025-05-07T20:33:50.5321690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5321796Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5322172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5322400Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5322759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5322851Z kernel = self.compile( 2025-05-07T20:33:50.5323254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5323429Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5323610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5323614Z 2025-05-07T20:33:50.5323826Z self = 2025-05-07T20:33:50.5324633Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5325152Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1cdfe020>} 2025-05-07T20:33:50.5326186Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5326389Z context = 2025-05-07T20:33:50.5326397Z 2025-05-07T20:33:50.5326654Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5326931Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5327041Z module_map=module_map) 2025-05-07T20:33:50.5327263Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5327361Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5327450Z E ^ 2025-05-07T20:33:50.5327816Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5327821Z 2025-05-07T20:33:50.5328252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5328257Z 2025-05-07T20:33:50.5328356Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5328585Z self=, 2025-05-07T20:33:50.5328664Z T=2048, 2025-05-07T20:33:50.5328740Z D=7168, 2025-05-07T20:33:50.5328819Z scale_ub=None, 2025-05-07T20:33:50.5328911Z contiguous=True, 2025-05-07T20:33:50.5329052Z compiled=True, 2025-05-07T20:33:50.5329132Z ) 2025-05-07T20:33:50.5329353Z self = 2025-05-07T20:33:50.5329522Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:50.5329526Z 2025-05-07T20:33:50.5329603Z @given( 2025-05-07T20:33:50.5329717Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5329813Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5329925Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5330038Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5330147Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5330225Z ) 2025-05-07T20:33:50.5330478Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5330571Z def test_silu_mul_quant( 2025-05-07T20:33:50.5330647Z self, 2025-05-07T20:33:50.5330728Z T: int, 2025-05-07T20:33:50.5330810Z D: int, 2025-05-07T20:33:50.5330909Z scale_ub: Optional[float], 2025-05-07T20:33:50.5330998Z contiguous: bool, 2025-05-07T20:33:50.5331082Z compiled: bool, 2025-05-07T20:33:50.5331161Z ) -> None: 2025-05-07T20:33:50.5331254Z torch.manual_seed(2025) 2025-05-07T20:33:50.5331330Z 2025-05-07T20:33:50.5331496Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5331569Z 2025-05-07T20:33:50.5331663Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5331783Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5331867Z x = x_sign * x_clamp 2025-05-07T20:33:50.5332035Z x0 = x[:, :D] 2025-05-07T20:33:50.5332113Z x1 = x[:, D:] 2025-05-07T20:33:50.5332192Z 2025-05-07T20:33:50.5332277Z if contiguous: 2025-05-07T20:33:50.5332372Z x0 = x0.contiguous() 2025-05-07T20:33:50.5332463Z x1 = x1.contiguous() 2025-05-07T20:33:50.5332541Z 2025-05-07T20:33:50.5332633Z if scale_ub is not None: 2025-05-07T20:33:50.5332745Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5332879Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5332956Z ) 2025-05-07T20:33:50.5333035Z else: 2025-05-07T20:33:50.5333129Z scale_ub_tensor = None 2025-05-07T20:33:50.5333201Z 2025-05-07T20:33:50.5333331Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5333416Z op = silu_mul_quant 2025-05-07T20:33:50.5333504Z if compiled: 2025-05-07T20:33:50.5333599Z op = torch.compile(op) 2025-05-07T20:33:50.5333701Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5333777Z 2025-05-07T20:33:50.5333911Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5333916Z 2025-05-07T20:33:50.5334008Z moe/activation_test.py:117: 2025-05-07T20:33:50.5334146Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5334283Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5334387Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5334824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.5334918Z return fn(*args, **kwargs) 2025-05-07T20:33:50.5335438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5335533Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5335902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5336136Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5336490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5336626Z kernel = self.compile( 2025-05-07T20:33:50.5337030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5337205Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5337343Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5337347Z 2025-05-07T20:33:50.5337554Z self = 2025-05-07T20:33:50.5338359Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5338881Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1cdff240>} 2025-05-07T20:33:50.5339670Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5339867Z context = 2025-05-07T20:33:50.5339872Z 2025-05-07T20:33:50.5340040Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5340314Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5340419Z module_map=module_map) 2025-05-07T20:33:50.5340583Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5340733Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5340813Z E ^ 2025-05-07T20:33:50.5341180Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5341187Z 2025-05-07T20:33:50.5341623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5341630Z 2025-05-07T20:33:50.5341728Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5341955Z self=, 2025-05-07T20:33:50.5342032Z T=16384, 2025-05-07T20:33:50.5342108Z D=5120, 2025-05-07T20:33:50.5342189Z scale_ub=None, 2025-05-07T20:33:50.5342272Z contiguous=False, 2025-05-07T20:33:50.5342352Z compiled=False, 2025-05-07T20:33:50.5342429Z ) 2025-05-07T20:33:50.5342649Z self = 2025-05-07T20:33:50.5342834Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:50.5342881Z 2025-05-07T20:33:50.5342962Z @given( 2025-05-07T20:33:50.5343082Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5343190Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5343307Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5343465Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5343584Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5343656Z ) 2025-05-07T20:33:50.5343906Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5344003Z def test_silu_mul_quant( 2025-05-07T20:33:50.5344081Z self, 2025-05-07T20:33:50.5344160Z T: int, 2025-05-07T20:33:50.5344237Z D: int, 2025-05-07T20:33:50.5344337Z scale_ub: Optional[float], 2025-05-07T20:33:50.5344430Z contiguous: bool, 2025-05-07T20:33:50.5344518Z compiled: bool, 2025-05-07T20:33:50.5344591Z ) -> None: 2025-05-07T20:33:50.5344690Z torch.manual_seed(2025) 2025-05-07T20:33:50.5344763Z 2025-05-07T20:33:50.5344932Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5345047Z 2025-05-07T20:33:50.5345137Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5345263Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5347182Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5347191Z 2025-05-07T20:33:50.5347310Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:50.5347317Z 2025-05-07T20:33:50.5347418Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5347647Z self=, 2025-05-07T20:33:50.5347726Z T=4096, 2025-05-07T20:33:50.5347806Z D=7168, 2025-05-07T20:33:50.5347889Z scale_ub=1200.0, 2025-05-07T20:33:50.5347977Z contiguous=True, 2025-05-07T20:33:50.5348058Z compiled=True, 2025-05-07T20:33:50.5348127Z ) 2025-05-07T20:33:50.5348352Z self = 2025-05-07T20:33:50.5348524Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:50.5348529Z 2025-05-07T20:33:50.5348606Z @given( 2025-05-07T20:33:50.5348725Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5348822Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5348982Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5349094Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5349203Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5349279Z ) 2025-05-07T20:33:50.5349527Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5349620Z def test_silu_mul_quant( 2025-05-07T20:33:50.5349695Z self, 2025-05-07T20:33:50.5349770Z T: int, 2025-05-07T20:33:50.5349846Z D: int, 2025-05-07T20:33:50.5349948Z scale_ub: Optional[float], 2025-05-07T20:33:50.5350033Z contiguous: bool, 2025-05-07T20:33:50.5350117Z compiled: bool, 2025-05-07T20:33:50.5350193Z ) -> None: 2025-05-07T20:33:50.5350283Z torch.manual_seed(2025) 2025-05-07T20:33:50.5350357Z 2025-05-07T20:33:50.5350522Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5350595Z 2025-05-07T20:33:50.5350687Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5350850Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5352763Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5352814Z 2025-05-07T20:33:50.5352930Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:50.5352935Z 2025-05-07T20:33:50.5353033Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5353261Z self=, 2025-05-07T20:33:50.5353342Z T=16384, 2025-05-07T20:33:50.5353425Z D=7168, 2025-05-07T20:33:50.5353507Z scale_ub=None, 2025-05-07T20:33:50.5353591Z contiguous=False, 2025-05-07T20:33:50.5353715Z compiled=False, 2025-05-07T20:33:50.5353793Z ) 2025-05-07T20:33:50.5354011Z self = 2025-05-07T20:33:50.5354194Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:50.5354198Z 2025-05-07T20:33:50.5355659Z @given( 2025-05-07T20:33:50.5355777Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5355877Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5360560Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5360689Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5360805Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5360883Z ) 2025-05-07T20:33:50.5361138Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5361234Z def test_silu_mul_quant( 2025-05-07T20:33:50.5361308Z self, 2025-05-07T20:33:50.5361383Z T: int, 2025-05-07T20:33:50.5361461Z D: int, 2025-05-07T20:33:50.5361556Z scale_ub: Optional[float], 2025-05-07T20:33:50.5361646Z contiguous: bool, 2025-05-07T20:33:50.5361730Z compiled: bool, 2025-05-07T20:33:50.5361806Z ) -> None: 2025-05-07T20:33:50.5361899Z torch.manual_seed(2025) 2025-05-07T20:33:50.5361970Z 2025-05-07T20:33:50.5362140Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5364073Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5364146Z 2025-05-07T20:33:50.5364265Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:50.5364269Z 2025-05-07T20:33:50.5364371Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5364596Z self=, 2025-05-07T20:33:50.5364673Z T=2048, 2025-05-07T20:33:50.5364751Z D=7168, 2025-05-07T20:33:50.5364831Z scale_ub=1200.0, 2025-05-07T20:33:50.5364910Z contiguous=True, 2025-05-07T20:33:50.5364993Z compiled=True, 2025-05-07T20:33:50.5365067Z ) 2025-05-07T20:33:50.5365291Z self = 2025-05-07T20:33:50.5365468Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:50.5365474Z 2025-05-07T20:33:50.5365594Z @given( 2025-05-07T20:33:50.5365712Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5365810Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5365923Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5366087Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5366197Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5366267Z ) 2025-05-07T20:33:50.5366524Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5366616Z def test_silu_mul_quant( 2025-05-07T20:33:50.5366691Z self, 2025-05-07T20:33:50.5366770Z T: int, 2025-05-07T20:33:50.5366848Z D: int, 2025-05-07T20:33:50.5366943Z scale_ub: Optional[float], 2025-05-07T20:33:50.5367032Z contiguous: bool, 2025-05-07T20:33:50.5367120Z compiled: bool, 2025-05-07T20:33:50.5367203Z ) -> None: 2025-05-07T20:33:50.5367295Z torch.manual_seed(2025) 2025-05-07T20:33:50.5367366Z 2025-05-07T20:33:50.5367537Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5367650Z 2025-05-07T20:33:50.5367743Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5367871Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5369776Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5369785Z 2025-05-07T20:33:50.5369913Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:50.5369917Z 2025-05-07T20:33:50.5370017Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5370253Z self=, 2025-05-07T20:33:50.5370328Z T=2048, 2025-05-07T20:33:50.5370407Z D=7168, 2025-05-07T20:33:50.5370496Z scale_ub=None, 2025-05-07T20:33:50.5370581Z contiguous=True, 2025-05-07T20:33:50.5370664Z compiled=False, 2025-05-07T20:33:50.5370740Z ) 2025-05-07T20:33:50.5370961Z self = 2025-05-07T20:33:50.5371137Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:50.5371142Z 2025-05-07T20:33:50.5371225Z @given( 2025-05-07T20:33:50.5371339Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5371435Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5371595Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5371716Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5371829Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5371898Z ) 2025-05-07T20:33:50.5372148Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5372245Z def test_silu_mul_quant( 2025-05-07T20:33:50.5372318Z self, 2025-05-07T20:33:50.5372391Z T: int, 2025-05-07T20:33:50.5372469Z D: int, 2025-05-07T20:33:50.5372564Z scale_ub: Optional[float], 2025-05-07T20:33:50.5372651Z contiguous: bool, 2025-05-07T20:33:50.5372737Z compiled: bool, 2025-05-07T20:33:50.5372810Z ) -> None: 2025-05-07T20:33:50.5372903Z torch.manual_seed(2025) 2025-05-07T20:33:50.5372975Z 2025-05-07T20:33:50.5373142Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5373224Z 2025-05-07T20:33:50.5373314Z > x_sign = torch.sign(x) 2025-05-07T20:33:50.5375345Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5375394Z 2025-05-07T20:33:50.5375512Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:50.5375516Z 2025-05-07T20:33:50.5375616Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5375843Z self=, 2025-05-07T20:33:50.5375916Z T=1, 2025-05-07T20:33:50.5375997Z D=7168, 2025-05-07T20:33:50.5376084Z scale_ub=1200.0, 2025-05-07T20:33:50.5376169Z contiguous=True, 2025-05-07T20:33:50.5376251Z compiled=False, 2025-05-07T20:33:50.5376332Z ) 2025-05-07T20:33:50.5376589Z self = 2025-05-07T20:33:50.5376763Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:50.5376771Z 2025-05-07T20:33:50.5376849Z @given( 2025-05-07T20:33:50.5376963Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5377061Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5377177Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5377289Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5377401Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5377474Z ) 2025-05-07T20:33:50.5377722Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5377823Z def test_silu_mul_quant( 2025-05-07T20:33:50.5377898Z self, 2025-05-07T20:33:50.5377977Z T: int, 2025-05-07T20:33:50.5378053Z D: int, 2025-05-07T20:33:50.5378148Z scale_ub: Optional[float], 2025-05-07T20:33:50.5378241Z contiguous: bool, 2025-05-07T20:33:50.5378322Z compiled: bool, 2025-05-07T20:33:50.5378401Z ) -> None: 2025-05-07T20:33:50.5378496Z torch.manual_seed(2025) 2025-05-07T20:33:50.5378567Z 2025-05-07T20:33:50.5378734Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5378811Z 2025-05-07T20:33:50.5378903Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5379024Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5379110Z x = x_sign * x_clamp 2025-05-07T20:33:50.5379185Z x0 = x[:, :D] 2025-05-07T20:33:50.5379267Z x1 = x[:, D:] 2025-05-07T20:33:50.5379339Z 2025-05-07T20:33:50.5379420Z if contiguous: 2025-05-07T20:33:50.5379555Z x0 = x0.contiguous() 2025-05-07T20:33:50.5379645Z x1 = x1.contiguous() 2025-05-07T20:33:50.5379713Z 2025-05-07T20:33:50.5379804Z if scale_ub is not None: 2025-05-07T20:33:50.5379910Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5380046Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5380122Z ) 2025-05-07T20:33:50.5380198Z else: 2025-05-07T20:33:50.5380287Z scale_ub_tensor = None 2025-05-07T20:33:50.5380362Z 2025-05-07T20:33:50.5380490Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5380582Z op = silu_mul_quant 2025-05-07T20:33:50.5380661Z if compiled: 2025-05-07T20:33:50.5380759Z op = torch.compile(op) 2025-05-07T20:33:50.5380863Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5380936Z 2025-05-07T20:33:50.5381023Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5381030Z 2025-05-07T20:33:50.5381128Z moe/activation_test.py:117: 2025-05-07T20:33:50.5381299Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5381402Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5381508Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5382032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5382174Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5382550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5382781Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5383136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5383226Z kernel = self.compile( 2025-05-07T20:33:50.5383633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5383809Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5383978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5383983Z 2025-05-07T20:33:50.5384200Z self = 2025-05-07T20:33:50.5385012Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5385530Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1dd96520>} 2025-05-07T20:33:50.5386323Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5386521Z context = 2025-05-07T20:33:50.5386526Z 2025-05-07T20:33:50.5386702Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5386973Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5387081Z module_map=module_map) 2025-05-07T20:33:50.5387241Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5387337Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5387418Z E ^ 2025-05-07T20:33:50.5387784Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5387789Z 2025-05-07T20:33:50.5388219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5388294Z 2025-05-07T20:33:50.5388394Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5388619Z self=, 2025-05-07T20:33:50.5388702Z T=128, 2025-05-07T20:33:50.5388779Z D=5120, 2025-05-07T20:33:50.5388861Z scale_ub=None, 2025-05-07T20:33:50.5388947Z contiguous=True, 2025-05-07T20:33:50.5389027Z compiled=False, 2025-05-07T20:33:50.5389102Z ) 2025-05-07T20:33:50.5389328Z self = 2025-05-07T20:33:50.5389497Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:50.5389502Z 2025-05-07T20:33:50.5389581Z @given( 2025-05-07T20:33:50.5389702Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5389798Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5389912Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5390072Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5390185Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5390257Z ) 2025-05-07T20:33:50.5390509Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5390668Z def test_silu_mul_quant( 2025-05-07T20:33:50.5390749Z self, 2025-05-07T20:33:50.5390825Z T: int, 2025-05-07T20:33:50.5390902Z D: int, 2025-05-07T20:33:50.5391000Z scale_ub: Optional[float], 2025-05-07T20:33:50.5391084Z contiguous: bool, 2025-05-07T20:33:50.5391168Z compiled: bool, 2025-05-07T20:33:50.5391243Z ) -> None: 2025-05-07T20:33:50.5391333Z torch.manual_seed(2025) 2025-05-07T20:33:50.5391409Z 2025-05-07T20:33:50.5391575Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5391649Z 2025-05-07T20:33:50.5391739Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5391863Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5391952Z x = x_sign * x_clamp 2025-05-07T20:33:50.5392034Z x0 = x[:, :D] 2025-05-07T20:33:50.5392111Z x1 = x[:, D:] 2025-05-07T20:33:50.5392223Z 2025-05-07T20:33:50.5392310Z if contiguous: 2025-05-07T20:33:50.5392405Z x0 = x0.contiguous() 2025-05-07T20:33:50.5392494Z x1 = x1.contiguous() 2025-05-07T20:33:50.5392571Z 2025-05-07T20:33:50.5392661Z if scale_ub is not None: 2025-05-07T20:33:50.5392774Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5392907Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5392984Z ) 2025-05-07T20:33:50.5393065Z else: 2025-05-07T20:33:50.5393162Z scale_ub_tensor = None 2025-05-07T20:33:50.5393239Z 2025-05-07T20:33:50.5393370Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5393460Z op = silu_mul_quant 2025-05-07T20:33:50.5393542Z if compiled: 2025-05-07T20:33:50.5393647Z op = torch.compile(op) 2025-05-07T20:33:50.5393749Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5393815Z 2025-05-07T20:33:50.5393906Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5393911Z 2025-05-07T20:33:50.5394009Z moe/activation_test.py:117: 2025-05-07T20:33:50.5394144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5394242Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5394345Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5394867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5394964Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5395335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5395609Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5395961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5396056Z kernel = self.compile( 2025-05-07T20:33:50.5396455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5396629Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5396755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5396760Z 2025-05-07T20:33:50.5396964Z self = 2025-05-07T20:33:50.5397773Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5398330Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1dd97420>} 2025-05-07T20:33:50.5399123Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5399356Z context = 2025-05-07T20:33:50.5399361Z 2025-05-07T20:33:50.5399529Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5399803Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5399908Z module_map=module_map) 2025-05-07T20:33:50.5400068Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5400167Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5400242Z E ^ 2025-05-07T20:33:50.5400613Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5400617Z 2025-05-07T20:33:50.5401089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5401099Z 2025-05-07T20:33:50.5401200Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5401432Z self=, 2025-05-07T20:33:50.5401509Z T=128, 2025-05-07T20:33:50.5401583Z D=7168, 2025-05-07T20:33:50.5401666Z scale_ub=None, 2025-05-07T20:33:50.5401750Z contiguous=True, 2025-05-07T20:33:50.5401832Z compiled=False, 2025-05-07T20:33:50.5401906Z ) 2025-05-07T20:33:50.5402125Z self = 2025-05-07T20:33:50.5402303Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:50.5402307Z 2025-05-07T20:33:50.5402388Z @given( 2025-05-07T20:33:50.5402504Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5402606Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5402719Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5402837Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5402949Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5403018Z ) 2025-05-07T20:33:50.5403269Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5403360Z def test_silu_mul_quant( 2025-05-07T20:33:50.5403429Z self, 2025-05-07T20:33:50.5403512Z T: int, 2025-05-07T20:33:50.5403584Z D: int, 2025-05-07T20:33:50.5403680Z scale_ub: Optional[float], 2025-05-07T20:33:50.5403769Z contiguous: bool, 2025-05-07T20:33:50.5403902Z compiled: bool, 2025-05-07T20:33:50.5403980Z ) -> None: 2025-05-07T20:33:50.5404076Z torch.manual_seed(2025) 2025-05-07T20:33:50.5404145Z 2025-05-07T20:33:50.5404315Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5404398Z 2025-05-07T20:33:50.5404487Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5404613Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5404700Z x = x_sign * x_clamp 2025-05-07T20:33:50.5404779Z x0 = x[:, :D] 2025-05-07T20:33:50.5404861Z x1 = x[:, D:] 2025-05-07T20:33:50.5404937Z 2025-05-07T20:33:50.5405021Z if contiguous: 2025-05-07T20:33:50.5405117Z x0 = x0.contiguous() 2025-05-07T20:33:50.5405206Z x1 = x1.contiguous() 2025-05-07T20:33:50.5405281Z 2025-05-07T20:33:50.5405370Z if scale_ub is not None: 2025-05-07T20:33:50.5405474Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5405603Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5405679Z ) 2025-05-07T20:33:50.5405797Z else: 2025-05-07T20:33:50.5405899Z scale_ub_tensor = None 2025-05-07T20:33:50.5405969Z 2025-05-07T20:33:50.5406097Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5406194Z op = silu_mul_quant 2025-05-07T20:33:50.5406319Z if compiled: 2025-05-07T20:33:50.5406418Z op = torch.compile(op) 2025-05-07T20:33:50.5406526Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5406600Z 2025-05-07T20:33:50.5406690Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5406694Z 2025-05-07T20:33:50.5406793Z moe/activation_test.py:117: 2025-05-07T20:33:50.5406924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5407029Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5407132Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5407656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5407752Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5408168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5408395Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5408754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5408849Z kernel = self.compile( 2025-05-07T20:33:50.5409253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5409430Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5409554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5409561Z 2025-05-07T20:33:50.5409774Z self = 2025-05-07T20:33:50.5410633Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5411147Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1cc204a0>} 2025-05-07T20:33:50.5411938Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5412132Z context = 2025-05-07T20:33:50.5412137Z 2025-05-07T20:33:50.5412313Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5412812Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5412925Z module_map=module_map) 2025-05-07T20:33:50.5413089Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5413191Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5413276Z E ^ 2025-05-07T20:33:50.5413646Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5413650Z 2025-05-07T20:33:50.5414080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5414087Z 2025-05-07T20:33:50.5414186Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5414412Z self=, 2025-05-07T20:33:50.5414536Z T=2048, 2025-05-07T20:33:50.5414614Z D=7168, 2025-05-07T20:33:50.5414696Z scale_ub=1200.0, 2025-05-07T20:33:50.5414835Z contiguous=True, 2025-05-07T20:33:50.5414920Z compiled=False, 2025-05-07T20:33:50.5414993Z ) 2025-05-07T20:33:50.5415226Z self = 2025-05-07T20:33:50.5415405Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:50.5415451Z 2025-05-07T20:33:50.5415530Z @given( 2025-05-07T20:33:50.5415647Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5415746Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5415863Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5415981Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5416096Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5416176Z ) 2025-05-07T20:33:50.5416427Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5416524Z def test_silu_mul_quant( 2025-05-07T20:33:50.5416608Z self, 2025-05-07T20:33:50.5416686Z T: int, 2025-05-07T20:33:50.5416764Z D: int, 2025-05-07T20:33:50.5416865Z scale_ub: Optional[float], 2025-05-07T20:33:50.5416996Z contiguous: bool, 2025-05-07T20:33:50.5417088Z compiled: bool, 2025-05-07T20:33:50.5417168Z ) -> None: 2025-05-07T20:33:50.5417264Z torch.manual_seed(2025) 2025-05-07T20:33:50.5417340Z 2025-05-07T20:33:50.5417509Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5419422Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5419435Z 2025-05-07T20:33:50.5419550Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:50.5419555Z 2025-05-07T20:33:50.5419683Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5419941Z self=, 2025-05-07T20:33:50.5420016Z T=1, 2025-05-07T20:33:50.5420089Z D=5120, 2025-05-07T20:33:50.5420174Z scale_ub=1200.0, 2025-05-07T20:33:50.5420255Z contiguous=True, 2025-05-07T20:33:50.5420335Z compiled=False, 2025-05-07T20:33:50.5420417Z ) 2025-05-07T20:33:50.5420643Z self = 2025-05-07T20:33:50.5420804Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:50.5420809Z 2025-05-07T20:33:50.5420933Z @given( 2025-05-07T20:33:50.5421054Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5421154Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5421267Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5421392Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5421504Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5421584Z ) 2025-05-07T20:33:50.5421835Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5421930Z def test_silu_mul_quant( 2025-05-07T20:33:50.5422010Z self, 2025-05-07T20:33:50.5422086Z T: int, 2025-05-07T20:33:50.5422164Z D: int, 2025-05-07T20:33:50.5422264Z scale_ub: Optional[float], 2025-05-07T20:33:50.5422352Z contiguous: bool, 2025-05-07T20:33:50.5422436Z compiled: bool, 2025-05-07T20:33:50.5422514Z ) -> None: 2025-05-07T20:33:50.5422605Z torch.manual_seed(2025) 2025-05-07T20:33:50.5422679Z 2025-05-07T20:33:50.5422916Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5422989Z 2025-05-07T20:33:50.5423081Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5423207Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5423296Z x = x_sign * x_clamp 2025-05-07T20:33:50.5423420Z x0 = x[:, :D] 2025-05-07T20:33:50.5423503Z x1 = x[:, D:] 2025-05-07T20:33:50.5423575Z 2025-05-07T20:33:50.5423662Z if contiguous: 2025-05-07T20:33:50.5423755Z x0 = x0.contiguous() 2025-05-07T20:33:50.5423845Z x1 = x1.contiguous() 2025-05-07T20:33:50.5423924Z 2025-05-07T20:33:50.5424014Z if scale_ub is not None: 2025-05-07T20:33:50.5424118Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5424256Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5424331Z ) 2025-05-07T20:33:50.5424412Z else: 2025-05-07T20:33:50.5424512Z scale_ub_tensor = None 2025-05-07T20:33:50.5424589Z 2025-05-07T20:33:50.5424722Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5424811Z op = silu_mul_quant 2025-05-07T20:33:50.5424933Z if compiled: 2025-05-07T20:33:50.5425035Z op = torch.compile(op) 2025-05-07T20:33:50.5425139Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5425210Z 2025-05-07T20:33:50.5425302Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5425306Z 2025-05-07T20:33:50.5425727Z moe/activation_test.py:117: 2025-05-07T20:33:50.5425914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5426055Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5426190Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5426741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5426843Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5427217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5427447Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5427805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5427899Z kernel = self.compile( 2025-05-07T20:33:50.5428295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5428471Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5428603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5428607Z 2025-05-07T20:33:50.5428815Z self = 2025-05-07T20:33:50.5429732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5430248Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1cc21a80>} 2025-05-07T20:33:50.5431043Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5431238Z context = 2025-05-07T20:33:50.5431243Z 2025-05-07T20:33:50.5431411Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5431684Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5431854Z module_map=module_map) 2025-05-07T20:33:50.5432017Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5432120Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5432198Z E ^ 2025-05-07T20:33:50.5432565Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5432633Z 2025-05-07T20:33:50.5433071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5433076Z 2025-05-07T20:33:50.5433176Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5433409Z self=, 2025-05-07T20:33:50.5433486Z T=2048, 2025-05-07T20:33:50.5433562Z D=5120, 2025-05-07T20:33:50.5433643Z scale_ub=None, 2025-05-07T20:33:50.5433724Z contiguous=True, 2025-05-07T20:33:50.5433808Z compiled=False, 2025-05-07T20:33:50.5433889Z ) 2025-05-07T20:33:50.5434115Z self = 2025-05-07T20:33:50.5434292Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:50.5434356Z 2025-05-07T20:33:50.5434436Z @given( 2025-05-07T20:33:50.5434557Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5434661Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5434776Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5434893Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5435013Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5435089Z ) 2025-05-07T20:33:50.5435340Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5435437Z def test_silu_mul_quant( 2025-05-07T20:33:50.5435521Z self, 2025-05-07T20:33:50.5435604Z T: int, 2025-05-07T20:33:50.5435683Z D: int, 2025-05-07T20:33:50.5435788Z scale_ub: Optional[float], 2025-05-07T20:33:50.5435880Z contiguous: bool, 2025-05-07T20:33:50.5435966Z compiled: bool, 2025-05-07T20:33:50.5436044Z ) -> None: 2025-05-07T20:33:50.5436142Z torch.manual_seed(2025) 2025-05-07T20:33:50.5436212Z 2025-05-07T20:33:50.5436381Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5436463Z 2025-05-07T20:33:50.5436553Z > x_sign = torch.sign(x) 2025-05-07T20:33:50.5438471Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5438522Z 2025-05-07T20:33:50.5438637Z moe/activation_test.py:94: OutOfMemoryError 2025-05-07T20:33:50.5438642Z 2025-05-07T20:33:50.5438745Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5438978Z self=, 2025-05-07T20:33:50.5439055Z T=16384, 2025-05-07T20:33:50.5439135Z D=5120, 2025-05-07T20:33:50.5439214Z scale_ub=None, 2025-05-07T20:33:50.5439299Z contiguous=True, 2025-05-07T20:33:50.5439386Z compiled=False, 2025-05-07T20:33:50.5439463Z ) 2025-05-07T20:33:50.5439680Z self = 2025-05-07T20:33:50.5439859Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:50.5439864Z 2025-05-07T20:33:50.5439938Z @given( 2025-05-07T20:33:50.5440055Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5440196Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5440309Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5440428Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5440539Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5440653Z ) 2025-05-07T20:33:50.5440911Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5441003Z def test_silu_mul_quant( 2025-05-07T20:33:50.5441077Z self, 2025-05-07T20:33:50.5441155Z T: int, 2025-05-07T20:33:50.5441227Z D: int, 2025-05-07T20:33:50.5441321Z scale_ub: Optional[float], 2025-05-07T20:33:50.5441409Z contiguous: bool, 2025-05-07T20:33:50.5441490Z compiled: bool, 2025-05-07T20:33:50.5441569Z ) -> None: 2025-05-07T20:33:50.5441662Z torch.manual_seed(2025) 2025-05-07T20:33:50.5441735Z 2025-05-07T20:33:50.5441910Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5443858Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5443868Z 2025-05-07T20:33:50.5443992Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:50.5443996Z 2025-05-07T20:33:50.5444097Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5444322Z self=, 2025-05-07T20:33:50.5444403Z T=4096, 2025-05-07T20:33:50.5444481Z D=5120, 2025-05-07T20:33:50.5444564Z scale_ub=None, 2025-05-07T20:33:50.5444652Z contiguous=True, 2025-05-07T20:33:50.5444735Z compiled=False, 2025-05-07T20:33:50.5444808Z ) 2025-05-07T20:33:50.5445029Z self = 2025-05-07T20:33:50.5445207Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:50.5445211Z 2025-05-07T20:33:50.5445293Z @given( 2025-05-07T20:33:50.5445409Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5445505Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5445619Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5445732Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5445841Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5445915Z ) 2025-05-07T20:33:50.5446161Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5446304Z def test_silu_mul_quant( 2025-05-07T20:33:50.5446380Z self, 2025-05-07T20:33:50.5446455Z T: int, 2025-05-07T20:33:50.5446531Z D: int, 2025-05-07T20:33:50.5446629Z scale_ub: Optional[float], 2025-05-07T20:33:50.5446714Z contiguous: bool, 2025-05-07T20:33:50.5446800Z compiled: bool, 2025-05-07T20:33:50.5446876Z ) -> None: 2025-05-07T20:33:50.5446966Z torch.manual_seed(2025) 2025-05-07T20:33:50.5447039Z 2025-05-07T20:33:50.5447207Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5449145Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5449154Z 2025-05-07T20:33:50.5449272Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:50.5449315Z 2025-05-07T20:33:50.5449414Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5449644Z self=, 2025-05-07T20:33:50.5449716Z T=2048, 2025-05-07T20:33:50.5449797Z D=5120, 2025-05-07T20:33:50.5449879Z scale_ub=None, 2025-05-07T20:33:50.5449964Z contiguous=False, 2025-05-07T20:33:50.5450054Z compiled=False, 2025-05-07T20:33:50.5450145Z ) 2025-05-07T20:33:50.5450390Z self = 2025-05-07T20:33:50.5450568Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:50.5450578Z 2025-05-07T20:33:50.5450656Z @given( 2025-05-07T20:33:50.5450775Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5450875Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5450985Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5451144Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5451259Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5451332Z ) 2025-05-07T20:33:50.5451584Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5451675Z def test_silu_mul_quant( 2025-05-07T20:33:50.5451750Z self, 2025-05-07T20:33:50.5451827Z T: int, 2025-05-07T20:33:50.5451902Z D: int, 2025-05-07T20:33:50.5451998Z scale_ub: Optional[float], 2025-05-07T20:33:50.5452088Z contiguous: bool, 2025-05-07T20:33:50.5452170Z compiled: bool, 2025-05-07T20:33:50.5452246Z ) -> None: 2025-05-07T20:33:50.5452343Z torch.manual_seed(2025) 2025-05-07T20:33:50.5452412Z 2025-05-07T20:33:50.5452586Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5454542Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5454551Z 2025-05-07T20:33:50.5454673Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:50.5454678Z 2025-05-07T20:33:50.5454776Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5455002Z self=, 2025-05-07T20:33:50.5455155Z T=4096, 2025-05-07T20:33:50.5455229Z D=7168, 2025-05-07T20:33:50.5455312Z scale_ub=None, 2025-05-07T20:33:50.5455399Z contiguous=True, 2025-05-07T20:33:50.5455483Z compiled=True, 2025-05-07T20:33:50.5455556Z ) 2025-05-07T20:33:50.5455779Z self = 2025-05-07T20:33:50.5455950Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:50.5455954Z 2025-05-07T20:33:50.5456033Z @given( 2025-05-07T20:33:50.5456145Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5456244Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5456358Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5456472Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5456584Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5456663Z ) 2025-05-07T20:33:50.5456959Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5457057Z def test_silu_mul_quant( 2025-05-07T20:33:50.5457129Z self, 2025-05-07T20:33:50.5457201Z T: int, 2025-05-07T20:33:50.5457281Z D: int, 2025-05-07T20:33:50.5457377Z scale_ub: Optional[float], 2025-05-07T20:33:50.5457505Z contiguous: bool, 2025-05-07T20:33:50.5457592Z compiled: bool, 2025-05-07T20:33:50.5457667Z ) -> None: 2025-05-07T20:33:50.5457760Z torch.manual_seed(2025) 2025-05-07T20:33:50.5457833Z 2025-05-07T20:33:50.5457998Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5459991Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5460000Z 2025-05-07T20:33:50.5460117Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:50.5460121Z 2025-05-07T20:33:50.5460220Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5460446Z self=, 2025-05-07T20:33:50.5460518Z T=2048, 2025-05-07T20:33:50.5460598Z D=5120, 2025-05-07T20:33:50.5460678Z scale_ub=1200.0, 2025-05-07T20:33:50.5460762Z contiguous=False, 2025-05-07T20:33:50.5460849Z compiled=False, 2025-05-07T20:33:50.5460929Z ) 2025-05-07T20:33:50.5461148Z self = 2025-05-07T20:33:50.5461331Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:50.5461338Z 2025-05-07T20:33:50.5461414Z @given( 2025-05-07T20:33:50.5461527Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5461632Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5461745Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5461867Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5461977Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5462049Z ) 2025-05-07T20:33:50.5462298Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5462388Z def test_silu_mul_quant( 2025-05-07T20:33:50.5462461Z self, 2025-05-07T20:33:50.5462536Z T: int, 2025-05-07T20:33:50.5462608Z D: int, 2025-05-07T20:33:50.5462702Z scale_ub: Optional[float], 2025-05-07T20:33:50.5462788Z contiguous: bool, 2025-05-07T20:33:50.5462872Z compiled: bool, 2025-05-07T20:33:50.5463001Z ) -> None: 2025-05-07T20:33:50.5463094Z torch.manual_seed(2025) 2025-05-07T20:33:50.5463162Z 2025-05-07T20:33:50.5463334Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5465230Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5465238Z 2025-05-07T20:33:50.5465354Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:50.5465358Z 2025-05-07T20:33:50.5465460Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5465724Z self=, 2025-05-07T20:33:50.5465804Z T=4096, 2025-05-07T20:33:50.5465880Z D=7168, 2025-05-07T20:33:50.5465962Z scale_ub=1200.0, 2025-05-07T20:33:50.5466050Z contiguous=True, 2025-05-07T20:33:50.5466132Z compiled=False, 2025-05-07T20:33:50.5466249Z ) 2025-05-07T20:33:50.5466470Z self = 2025-05-07T20:33:50.5466644Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:50.5466648Z 2025-05-07T20:33:50.5466731Z @given( 2025-05-07T20:33:50.5466844Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5466941Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5467057Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5467171Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5467281Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5467358Z ) 2025-05-07T20:33:50.5467611Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5467705Z def test_silu_mul_quant( 2025-05-07T20:33:50.5467819Z self, 2025-05-07T20:33:50.5467892Z T: int, 2025-05-07T20:33:50.5467968Z D: int, 2025-05-07T20:33:50.5468062Z scale_ub: Optional[float], 2025-05-07T20:33:50.5468148Z contiguous: bool, 2025-05-07T20:33:50.5468233Z compiled: bool, 2025-05-07T20:33:50.5468310Z ) -> None: 2025-05-07T20:33:50.5468399Z torch.manual_seed(2025) 2025-05-07T20:33:50.5468470Z 2025-05-07T20:33:50.5468638Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5470546Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5470556Z 2025-05-07T20:33:50.5470670Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:50.5470674Z 2025-05-07T20:33:50.5470779Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5471003Z self=, 2025-05-07T20:33:50.5471076Z T=16384, 2025-05-07T20:33:50.5471156Z D=7168, 2025-05-07T20:33:50.5471238Z scale_ub=None, 2025-05-07T20:33:50.5471323Z contiguous=False, 2025-05-07T20:33:50.5471406Z compiled=True, 2025-05-07T20:33:50.5471478Z ) 2025-05-07T20:33:50.5471699Z self = 2025-05-07T20:33:50.5471927Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True 2025-05-07T20:33:50.5471932Z 2025-05-07T20:33:50.5472008Z @given( 2025-05-07T20:33:50.5472124Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5472231Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5472342Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5472457Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5472566Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5472640Z ) 2025-05-07T20:33:50.5472894Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5472987Z def test_silu_mul_quant( 2025-05-07T20:33:50.5473059Z self, 2025-05-07T20:33:50.5473135Z T: int, 2025-05-07T20:33:50.5473210Z D: int, 2025-05-07T20:33:50.5473306Z scale_ub: Optional[float], 2025-05-07T20:33:50.5473402Z contiguous: bool, 2025-05-07T20:33:50.5473485Z compiled: bool, 2025-05-07T20:33:50.5473611Z ) -> None: 2025-05-07T20:33:50.5473701Z torch.manual_seed(2025) 2025-05-07T20:33:50.5473772Z 2025-05-07T20:33:50.5473942Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5475891Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5475897Z 2025-05-07T20:33:50.5476014Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:50.5476021Z 2025-05-07T20:33:50.5476122Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5476344Z self=, 2025-05-07T20:33:50.5476421Z T=4096, 2025-05-07T20:33:50.5476538Z D=7168, 2025-05-07T20:33:50.5476627Z scale_ub=None, 2025-05-07T20:33:50.5476719Z contiguous=True, 2025-05-07T20:33:50.5476807Z compiled=False, 2025-05-07T20:33:50.5476882Z ) 2025-05-07T20:33:50.5477107Z self = 2025-05-07T20:33:50.5477280Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:50.5477285Z 2025-05-07T20:33:50.5477363Z @given( 2025-05-07T20:33:50.5477480Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5477580Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5477697Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5477816Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5477933Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5478009Z ) 2025-05-07T20:33:50.5478258Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5478360Z def test_silu_mul_quant( 2025-05-07T20:33:50.5478439Z self, 2025-05-07T20:33:50.5478518Z T: int, 2025-05-07T20:33:50.5478599Z D: int, 2025-05-07T20:33:50.5478692Z scale_ub: Optional[float], 2025-05-07T20:33:50.5478778Z contiguous: bool, 2025-05-07T20:33:50.5478863Z compiled: bool, 2025-05-07T20:33:50.5478936Z ) -> None: 2025-05-07T20:33:50.5479025Z torch.manual_seed(2025) 2025-05-07T20:33:50.5479095Z 2025-05-07T20:33:50.5479259Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5481171Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5481226Z 2025-05-07T20:33:50.5481341Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:50.5481345Z 2025-05-07T20:33:50.5481452Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5481683Z self=, 2025-05-07T20:33:50.5481758Z T=16384, 2025-05-07T20:33:50.5481835Z D=7168, 2025-05-07T20:33:50.5481919Z scale_ub=None, 2025-05-07T20:33:50.5482001Z contiguous=True, 2025-05-07T20:33:50.5482086Z compiled=False, 2025-05-07T20:33:50.5482163Z ) 2025-05-07T20:33:50.5482424Z self = 2025-05-07T20:33:50.5482603Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False 2025-05-07T20:33:50.5482608Z 2025-05-07T20:33:50.5482682Z @given( 2025-05-07T20:33:50.5482797Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5482938Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5483048Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5483166Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5483275Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5483349Z ) 2025-05-07T20:33:50.5483599Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5483695Z def test_silu_mul_quant( 2025-05-07T20:33:50.5483772Z self, 2025-05-07T20:33:50.5483848Z T: int, 2025-05-07T20:33:50.5483922Z D: int, 2025-05-07T20:33:50.5484019Z scale_ub: Optional[float], 2025-05-07T20:33:50.5484113Z contiguous: bool, 2025-05-07T20:33:50.5484194Z compiled: bool, 2025-05-07T20:33:50.5484277Z ) -> None: 2025-05-07T20:33:50.5484367Z torch.manual_seed(2025) 2025-05-07T20:33:50.5489044Z 2025-05-07T20:33:50.5489240Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5491168Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5491177Z 2025-05-07T20:33:50.5491297Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:50.5491305Z 2025-05-07T20:33:50.5491408Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5491640Z self=, 2025-05-07T20:33:50.5491722Z T=16384, 2025-05-07T20:33:50.5491805Z D=7168, 2025-05-07T20:33:50.5491889Z scale_ub=1200.0, 2025-05-07T20:33:50.5491974Z contiguous=True, 2025-05-07T20:33:50.5492062Z compiled=False, 2025-05-07T20:33:50.5492139Z ) 2025-05-07T20:33:50.5492359Z self = 2025-05-07T20:33:50.5492542Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:50.5492546Z 2025-05-07T20:33:50.5492622Z @given( 2025-05-07T20:33:50.5492740Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5492837Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5492997Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5493116Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5493227Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5493300Z ) 2025-05-07T20:33:50.5493555Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5493649Z def test_silu_mul_quant( 2025-05-07T20:33:50.5493728Z self, 2025-05-07T20:33:50.5493806Z T: int, 2025-05-07T20:33:50.5493884Z D: int, 2025-05-07T20:33:50.5493983Z scale_ub: Optional[float], 2025-05-07T20:33:50.5494071Z contiguous: bool, 2025-05-07T20:33:50.5494154Z compiled: bool, 2025-05-07T20:33:50.5494234Z ) -> None: 2025-05-07T20:33:50.5494327Z torch.manual_seed(2025) 2025-05-07T20:33:50.5494398Z 2025-05-07T20:33:50.5494729Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5496690Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5496736Z 2025-05-07T20:33:50.5496857Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:50.5496862Z 2025-05-07T20:33:50.5496963Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5497196Z self=, 2025-05-07T20:33:50.5497272Z T=128, 2025-05-07T20:33:50.5497351Z D=5120, 2025-05-07T20:33:50.5497436Z scale_ub=1200.0, 2025-05-07T20:33:50.5497523Z contiguous=False, 2025-05-07T20:33:50.5497605Z compiled=False, 2025-05-07T20:33:50.5497682Z ) 2025-05-07T20:33:50.5497905Z self = 2025-05-07T20:33:50.5498123Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False 2025-05-07T20:33:50.5498128Z 2025-05-07T20:33:50.5498209Z @given( 2025-05-07T20:33:50.5498326Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5498422Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5498539Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5498653Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5498771Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5498846Z ) 2025-05-07T20:33:50.5499096Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5499194Z def test_silu_mul_quant( 2025-05-07T20:33:50.5499271Z self, 2025-05-07T20:33:50.5499350Z T: int, 2025-05-07T20:33:50.5499437Z D: int, 2025-05-07T20:33:50.5499538Z scale_ub: Optional[float], 2025-05-07T20:33:50.5499626Z contiguous: bool, 2025-05-07T20:33:50.5499713Z compiled: bool, 2025-05-07T20:33:50.5499795Z ) -> None: 2025-05-07T20:33:50.5499889Z torch.manual_seed(2025) 2025-05-07T20:33:50.5499976Z 2025-05-07T20:33:50.5500142Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5500221Z 2025-05-07T20:33:50.5500311Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5500436Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5500538Z x = x_sign * x_clamp 2025-05-07T20:33:50.5500621Z x0 = x[:, :D] 2025-05-07T20:33:50.5500702Z x1 = x[:, D:] 2025-05-07T20:33:50.5500780Z 2025-05-07T20:33:50.5500867Z if contiguous: 2025-05-07T20:33:50.5500965Z x0 = x0.contiguous() 2025-05-07T20:33:50.5501066Z x1 = x1.contiguous() 2025-05-07T20:33:50.5501186Z 2025-05-07T20:33:50.5501283Z if scale_ub is not None: 2025-05-07T20:33:50.5501395Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5501530Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5501618Z ) 2025-05-07T20:33:50.5501699Z else: 2025-05-07T20:33:50.5501797Z scale_ub_tensor = None 2025-05-07T20:33:50.5501876Z 2025-05-07T20:33:50.5502005Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5502093Z op = silu_mul_quant 2025-05-07T20:33:50.5502183Z if compiled: 2025-05-07T20:33:50.5502281Z op = torch.compile(op) 2025-05-07T20:33:50.5502384Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5502458Z 2025-05-07T20:33:50.5502546Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5502551Z 2025-05-07T20:33:50.5502649Z moe/activation_test.py:117: 2025-05-07T20:33:50.5502786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5502928Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5503030Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5503562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5503698Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5504079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5504308Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5504664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5504757Z kernel = self.compile( 2025-05-07T20:33:50.5505158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5505340Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5505467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5505472Z 2025-05-07T20:33:50.5505719Z self = 2025-05-07T20:33:50.5506538Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5507056Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1c9487c0>} 2025-05-07T20:33:50.5507852Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5508051Z context = 2025-05-07T20:33:50.5508056Z 2025-05-07T20:33:50.5508227Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5508502Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5508609Z module_map=module_map) 2025-05-07T20:33:50.5508775Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5508872Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5508948Z E ^ 2025-05-07T20:33:50.5509322Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5509327Z 2025-05-07T20:33:50.5509758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5509804Z 2025-05-07T20:33:50.5509913Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5510144Z self=, 2025-05-07T20:33:50.5510219Z T=2048, 2025-05-07T20:33:50.5510296Z D=7168, 2025-05-07T20:33:50.5510379Z scale_ub=None, 2025-05-07T20:33:50.5510467Z contiguous=False, 2025-05-07T20:33:50.5510557Z compiled=False, 2025-05-07T20:33:50.5510631Z ) 2025-05-07T20:33:50.5510857Z self = 2025-05-07T20:33:50.5511034Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False 2025-05-07T20:33:50.5511038Z 2025-05-07T20:33:50.5511114Z @given( 2025-05-07T20:33:50.5511234Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5511331Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5511441Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5511558Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5511669Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5511784Z ) 2025-05-07T20:33:50.5512040Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5512135Z def test_silu_mul_quant( 2025-05-07T20:33:50.5512211Z self, 2025-05-07T20:33:50.5512327Z T: int, 2025-05-07T20:33:50.5512400Z D: int, 2025-05-07T20:33:50.5512500Z scale_ub: Optional[float], 2025-05-07T20:33:50.5512587Z contiguous: bool, 2025-05-07T20:33:50.5512670Z compiled: bool, 2025-05-07T20:33:50.5512747Z ) -> None: 2025-05-07T20:33:50.5512840Z torch.manual_seed(2025) 2025-05-07T20:33:50.5512913Z 2025-05-07T20:33:50.5513086Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5515037Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5515048Z 2025-05-07T20:33:50.5515169Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:50.5515174Z 2025-05-07T20:33:50.5515273Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5515499Z self=, 2025-05-07T20:33:50.5515574Z T=128, 2025-05-07T20:33:50.5515649Z D=7168, 2025-05-07T20:33:50.5515733Z scale_ub=1200.0, 2025-05-07T20:33:50.5515814Z contiguous=True, 2025-05-07T20:33:50.5515896Z compiled=True, 2025-05-07T20:33:50.5515978Z ) 2025-05-07T20:33:50.5516203Z self = 2025-05-07T20:33:50.5516373Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:50.5516378Z 2025-05-07T20:33:50.5516458Z @given( 2025-05-07T20:33:50.5516576Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5516679Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5516791Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5516903Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5517014Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5517087Z ) 2025-05-07T20:33:50.5517336Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5517428Z def test_silu_mul_quant( 2025-05-07T20:33:50.5517502Z self, 2025-05-07T20:33:50.5517575Z T: int, 2025-05-07T20:33:50.5517650Z D: int, 2025-05-07T20:33:50.5517745Z scale_ub: Optional[float], 2025-05-07T20:33:50.5517875Z contiguous: bool, 2025-05-07T20:33:50.5517962Z compiled: bool, 2025-05-07T20:33:50.5518039Z ) -> None: 2025-05-07T20:33:50.5518131Z torch.manual_seed(2025) 2025-05-07T20:33:50.5518197Z 2025-05-07T20:33:50.5518367Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5518446Z 2025-05-07T20:33:50.5518535Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5518661Z x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5518748Z x = x_sign * x_clamp 2025-05-07T20:33:50.5518822Z x0 = x[:, :D] 2025-05-07T20:33:50.5518898Z x1 = x[:, D:] 2025-05-07T20:33:50.5518972Z 2025-05-07T20:33:50.5519049Z if contiguous: 2025-05-07T20:33:50.5519138Z x0 = x0.contiguous() 2025-05-07T20:33:50.5519225Z x1 = x1.contiguous() 2025-05-07T20:33:50.5519293Z 2025-05-07T20:33:50.5519379Z if scale_ub is not None: 2025-05-07T20:33:50.5519487Z scale_ub_tensor = torch.tensor( 2025-05-07T20:33:50.5519678Z [scale_ub], device="cuda", dtype=torch.float32 2025-05-07T20:33:50.5519768Z ) 2025-05-07T20:33:50.5519856Z else: 2025-05-07T20:33:50.5519961Z scale_ub_tensor = None 2025-05-07T20:33:50.5520038Z 2025-05-07T20:33:50.5520227Z def fn() -> Tuple[torch.Tensor, torch.Tensor]: 2025-05-07T20:33:50.5520311Z op = silu_mul_quant 2025-05-07T20:33:50.5520395Z if compiled: 2025-05-07T20:33:50.5520489Z op = torch.compile(op) 2025-05-07T20:33:50.5520589Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5520663Z 2025-05-07T20:33:50.5520751Z > y_fp8, y_scale = fn() 2025-05-07T20:33:50.5520755Z 2025-05-07T20:33:50.5520849Z moe/activation_test.py:117: 2025-05-07T20:33:50.5520981Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5521077Z moe/activation_test.py:115: in fn 2025-05-07T20:33:50.5521182Z return op(x0, x1, scale_ub_tensor) 2025-05-07T20:33:50.5521567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:678: in _fn 2025-05-07T20:33:50.5521657Z return fn(*args, **kwargs) 2025-05-07T20:33:50.5522226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant 2025-05-07T20:33:50.5522323Z _fbgemm_silu_mul_quant[grid]( 2025-05-07T20:33:50.5522697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:330: in 2025-05-07T20:33:50.5522920Z return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 2025-05-07T20:33:50.5523273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/jit.py:623: in run 2025-05-07T20:33:50.5523371Z kernel = self.compile( 2025-05-07T20:33:50.5523768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:273: in compile 2025-05-07T20:33:50.5523951Z module = src.make_ir(options, codegen_fns, module_map, context) 2025-05-07T20:33:50.5524085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:33:50.5524089Z 2025-05-07T20:33:50.5524294Z self = 2025-05-07T20:33:50.5525102Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True) 2025-05-07T20:33:50.5525981Z codegen_fns = {'convert_custom_types': , 'min_dot_size': . at 0x7f1b1c949940>} 2025-05-07T20:33:50.5526782Z module_map = {'triton.language.extra.libdevice': } 2025-05-07T20:33:50.5527067Z context = 2025-05-07T20:33:50.5527072Z 2025-05-07T20:33:50.5527241Z def make_ir(self, options, codegen_fns, module_map, context): 2025-05-07T20:33:50.5527520Z > return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, 2025-05-07T20:33:50.5527629Z module_map=module_map) 2025-05-07T20:33:50.5527793Z E triton.compiler.errors.CompilationError: at 1:0: 2025-05-07T20:33:50.5527900Z E def _fbgemm_silu_mul_quant( 2025-05-07T20:33:50.5527978Z E ^ 2025-05-07T20:33:50.5528350Z E ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") 2025-05-07T20:33:50.5528355Z 2025-05-07T20:33:50.5528781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/compiler/compiler.py:100: CompilationError 2025-05-07T20:33:50.5528788Z 2025-05-07T20:33:50.5528953Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5529188Z self=, 2025-05-07T20:33:50.5529262Z T=128, 2025-05-07T20:33:50.5529336Z D=7168, 2025-05-07T20:33:50.5529479Z scale_ub=1200.0, 2025-05-07T20:33:50.5529563Z contiguous=True, 2025-05-07T20:33:50.5529653Z compiled=False, 2025-05-07T20:33:50.5529728Z ) 2025-05-07T20:33:50.5529954Z self = 2025-05-07T20:33:50.5530154Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False 2025-05-07T20:33:50.5530158Z 2025-05-07T20:33:50.5530253Z @given( 2025-05-07T20:33:50.5530376Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5530480Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5530593Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5530713Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5530829Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5530901Z ) 2025-05-07T20:33:50.5531214Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5531313Z def test_silu_mul_quant( 2025-05-07T20:33:50.5531393Z self, 2025-05-07T20:33:50.5531474Z T: int, 2025-05-07T20:33:50.5531549Z D: int, 2025-05-07T20:33:50.5531644Z scale_ub: Optional[float], 2025-05-07T20:33:50.5531732Z contiguous: bool, 2025-05-07T20:33:50.5531813Z compiled: bool, 2025-05-07T20:33:50.5531885Z ) -> None: 2025-05-07T20:33:50.5531979Z torch.manual_seed(2025) 2025-05-07T20:33:50.5532047Z 2025-05-07T20:33:50.5532220Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5532291Z 2025-05-07T20:33:50.5532380Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5532509Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5534421Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5534431Z 2025-05-07T20:33:50.5534636Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:50.5534641Z 2025-05-07T20:33:50.5534741Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5534965Z self=, 2025-05-07T20:33:50.5535048Z T=128, 2025-05-07T20:33:50.5535170Z D=5120, 2025-05-07T20:33:50.5535256Z scale_ub=1200.0, 2025-05-07T20:33:50.5535350Z contiguous=True, 2025-05-07T20:33:50.5535432Z compiled=True, 2025-05-07T20:33:50.5535512Z ) 2025-05-07T20:33:50.5535733Z self = 2025-05-07T20:33:50.5535901Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True 2025-05-07T20:33:50.5535907Z 2025-05-07T20:33:50.5535989Z @given( 2025-05-07T20:33:50.5536103Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5536199Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5536312Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5536426Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5536533Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5536609Z ) 2025-05-07T20:33:50.5536854Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5536955Z def test_silu_mul_quant( 2025-05-07T20:33:50.5537079Z self, 2025-05-07T20:33:50.5537156Z T: int, 2025-05-07T20:33:50.5537232Z D: int, 2025-05-07T20:33:50.5537325Z scale_ub: Optional[float], 2025-05-07T20:33:50.5537412Z contiguous: bool, 2025-05-07T20:33:50.5537498Z compiled: bool, 2025-05-07T20:33:50.5537613Z ) -> None: 2025-05-07T20:33:50.5537703Z torch.manual_seed(2025) 2025-05-07T20:33:50.5537777Z 2025-05-07T20:33:50.5537946Z x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5538018Z 2025-05-07T20:33:50.5538108Z x_sign = torch.sign(x) 2025-05-07T20:33:50.5538231Z > x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0) 2025-05-07T20:33:50.5540171Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5540182Z 2025-05-07T20:33:50.5540296Z moe/activation_test.py:95: OutOfMemoryError 2025-05-07T20:33:50.5540300Z 2025-05-07T20:33:50.5540401Z Trying example: test_silu_mul_quant( 2025-05-07T20:33:50.5540623Z self=, 2025-05-07T20:33:50.5540696Z T=128, 2025-05-07T20:33:50.5540775Z D=7168, 2025-05-07T20:33:50.5540865Z scale_ub=None, 2025-05-07T20:33:50.5540951Z contiguous=True, 2025-05-07T20:33:50.5541037Z compiled=True, 2025-05-07T20:33:50.5541112Z ) 2025-05-07T20:33:50.5541328Z self = 2025-05-07T20:33:50.5541502Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True 2025-05-07T20:33:50.5541507Z 2025-05-07T20:33:50.5541582Z @given( 2025-05-07T20:33:50.5541699Z T=st.sampled_from([1, 128, 2048, 4096, 16384]), 2025-05-07T20:33:50.5541801Z D=st.sampled_from([5120, 7168]), 2025-05-07T20:33:50.5541914Z scale_ub=st.sampled_from([None, 1200.00]), 2025-05-07T20:33:50.5542036Z contiguous=st.sampled_from([True, False]), 2025-05-07T20:33:50.5542145Z compiled=st.sampled_from([True, False]), 2025-05-07T20:33:50.5542213Z ) 2025-05-07T20:33:50.5542466Z @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None) 2025-05-07T20:33:50.5542555Z def test_silu_mul_quant( 2025-05-07T20:33:50.5542624Z self, 2025-05-07T20:33:50.5542702Z T: int, 2025-05-07T20:33:50.5542776Z D: int, 2025-05-07T20:33:50.5542881Z scale_ub: Optional[float], 2025-05-07T20:33:50.5543010Z contiguous: bool, 2025-05-07T20:33:50.5543091Z compiled: bool, 2025-05-07T20:33:50.5543175Z ) -> None: 2025-05-07T20:33:50.5543268Z torch.manual_seed(2025) 2025-05-07T20:33:50.5543338Z 2025-05-07T20:33:50.5543507Z > x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16) 2025-05-07T20:33:50.5545384Z E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) 2025-05-07T20:33:50.5545391Z 2025-05-07T20:33:50.5545507Z moe/activation_test.py:92: OutOfMemoryError 2025-05-07T20:33:50.5545682Z =============================== warnings summary =============================== 2025-05-07T20:33:50.5546002Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:50.5546325Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:50.5546674Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108 2025-05-07T20:33:50.5547606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details. 2025-05-07T20:33:50.5547837Z warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See " 2025-05-07T20:33:50.5547842Z 2025-05-07T20:33:50.5548061Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:33:50.5548232Z ================= 1 failed, 1 deselected, 3 warnings in 13.30s ================= 2025-05-07T20:33:52.2320817Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error) 2025-05-07T20:33:52.2976542Z [EXEC] [ATTEMPT 2/2] Command attempt failed. 2025-05-07T20:33:52.2977024Z 2025-05-07T20:33:52.2977367Z [EXEC] The command has failed after 2 + 1 attempts; aborting. 2025-05-07T20:33:52.2978537Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py 2025-05-07T20:33:52.2979369Z 2025-05-07T20:33:52.2979378Z 2025-05-07T20:33:52.2979385Z 2025-05-07T20:33:52.2995155Z ##[error]Process completed with exit code 1. 2025-05-07T20:33:52.3079048Z Post job cleanup. 2025-05-07T20:33:52.4064281Z [command]/usr/bin/git version 2025-05-07T20:33:52.4109069Z git version 2.47.1 2025-05-07T20:33:52.4149356Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/e2f51991-c98e-412c-96de-984594d25122/.gitconfig' 2025-05-07T20:33:52.4161768Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/e2f51991-c98e-412c-96de-984594d25122' before making global git config changes 2025-05-07T20:33:52.4162661Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:33:52.4167701Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM 2025-05-07T20:33:52.4213676Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:33:52.4249457Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:33:52.4590945Z Entering 'external/asmjit' 2025-05-07T20:33:52.4658129Z Entering 'external/composable_kernel' 2025-05-07T20:33:52.4730538Z Entering 'external/cpuinfo' 2025-05-07T20:33:52.4796396Z Entering 'external/cutlass' 2025-05-07T20:33:52.4872026Z Entering 'external/googletest' 2025-05-07T20:33:52.4939139Z Entering 'external/hipify_torch' 2025-05-07T20:33:52.5006267Z Entering 'external/json' 2025-05-07T20:33:52.5093203Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:33:52.5115990Z http.https://github.com/.extraheader 2025-05-07T20:33:52.5126817Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-05-07T20:33:52.5159277Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:33:52.5493711Z Entering 'external/asmjit' 2025-05-07T20:33:52.5536369Z http.https://github.com/.extraheader 2025-05-07T20:33:52.5579167Z Entering 'external/composable_kernel' 2025-05-07T20:33:52.5622244Z http.https://github.com/.extraheader 2025-05-07T20:33:52.5673019Z Entering 'external/cpuinfo' 2025-05-07T20:33:52.5715701Z http.https://github.com/.extraheader 2025-05-07T20:33:52.5759318Z Entering 'external/cutlass' 2025-05-07T20:33:52.5808990Z http.https://github.com/.extraheader 2025-05-07T20:33:52.5861097Z Entering 'external/googletest' 2025-05-07T20:33:52.5902910Z http.https://github.com/.extraheader 2025-05-07T20:33:52.5945667Z Entering 'external/hipify_torch' 2025-05-07T20:33:52.5987583Z http.https://github.com/.extraheader 2025-05-07T20:33:52.6029708Z Entering 'external/json' 2025-05-07T20:33:52.6072636Z http.https://github.com/.extraheader 2025-05-07T20:33:52.6221317Z A job completed hook has been configured by the self-hosted runner administrator 2025-05-07T20:33:52.6257761Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-05-07T20:33:52.6268486Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:33:52.6268874Z ##[endgroup] 2025-05-07T20:33:52.6371913Z [!ALERT!] Swap in detected! [!ALERT!] 2025-05-07T20:34:03.5949745Z [!ALERT!] Swap out detected [!ALERT!] 2025-05-07T20:34:20.3310344Z Cleaning up orphan processes